Purpose: Help you choose the right RAG pipeline for your use case Audience: Developers building RAG applications with iris-vector-rag
| Pipeline | Best For | Retrieval Strategy | When to Use |
|---|---|---|---|
| basic | General Q&A | Vector similarity | Getting started, baseline comparisons, simple use cases |
| basic_rerank | High precision | Vector + cross-encoder reranking | Legal/medical domains, accuracy-critical applications |
| crag | Self-correcting | Vector + evaluation + web search | Dynamic knowledge, fact-checking, current events |
| graphrag | Complex relationships | Vector + text + graph + RRF fusion | Research, medical knowledge, entity-heavy domains |
| multi_query_rrf | Robust recall | Multiple query variations + RRF | Ambiguous queries, exploratory search |
| pylate_colbert | Late interaction | ColBERT token-level matching | Fine-grained relevance, long documents |
How it Works:
- Embed query into vector
- Find top-k most similar document embeddings
- Generate answer from retrieved documents
Strengths:
- Fast and lightweight
- Low resource requirements
- Good baseline performance
- Easy to understand and debug
Weaknesses:
- No reranking (lower precision than basic_rerank)
- No self-correction (can return irrelevant results)
- No knowledge graph reasoning
Configuration:
from iris_vector_rag import create_pipeline
pipeline = create_pipeline(
'basic',
validate_requirements=True,
auto_setup=False
)
result = pipeline.query("What is diabetes?", top_k=5)Performance:
- Retrieval time: 100-300ms
- Memory: ~500MB (model + index)
- Accuracy: 70-80% (domain-dependent)
Best Use Cases:
- Getting started with RAG
- Proof-of-concept applications
- Well-scoped Q&A with high-quality documents
- Performance-critical applications where speed > accuracy
How it Works:
- Embed query and retrieve top-k candidates (k=20-50)
- Rerank candidates using cross-encoder model
- Select top-n most relevant (n=5)
- Generate answer from reranked documents
Strengths:
- Higher precision than basic (10-15% accuracy improvement)
- Better handling of semantic nuances
- Filters out false positives from vector search
Weaknesses:
- Slower than basic (2-3x overhead from reranking)
- Higher resource requirements (2 models: embedding + cross-encoder)
Configuration:
pipeline = create_pipeline(
'basic_rerank',
validate_requirements=True
)
# Retrieves 50 candidates, reranks to top 5
result = pipeline.query("What is diabetes?", top_k=5, candidate_k=50)Performance:
- Retrieval time: 300-600ms
- Memory: ~1.5GB (embedding model + cross-encoder)
- Accuracy: 80-90%
Best Use Cases:
- Legal research (precision matters)
- Medical Q&A (accuracy-critical)
- Customer support (fewer wrong answers)
- Enterprise search (user expectations high)
Comparison with basic:
# Test on medical Q&A dataset
basic_accuracy = 0.75 # 75% correct answers
rerank_accuracy = 0.88 # 88% correct answers
improvement = +17% accuracy
# Cost: 2.5x slower, 3x more memoryHow it Works:
- Retrieve documents via vector search
- Evaluate retrieved documents for relevance (LLM self-check)
- If irrelevant: Trigger web search fallback to find better sources
- Generate answer from corrected document set
Strengths:
- Self-correcting (detects and fixes poor retrieval)
- Adapts to knowledge gaps (web search fallback)
- Robust to outdated internal documents
Weaknesses:
- Slower than basic (3-5x due to evaluation + web search)
- Requires web search API (e.g., Tavily, Bing)
- Higher LLM costs (evaluation step uses LLM calls)
Configuration:
pipeline = create_pipeline(
'crag',
validate_requirements=True,
web_search_api_key="your-tavily-api-key" # Required for fallback
)
result = pipeline.query("What is the latest diabetes treatment?", top_k=5)
# → If internal docs outdated, triggers web searchPerformance:
- Retrieval time: 500-1500ms (depending on web search)
- Memory: ~800MB
- Accuracy: 85-95% (higher than basic due to self-correction)
Best Use Cases:
- Current events ("What happened today?")
- Fact-checking and verification
- Dynamic knowledge domains (news, research)
- Hybrid internal + external knowledge
When to Use Web Fallback:
- Internal documents may be outdated
- User queries reference recent events
- Knowledge base has gaps
- External sources needed for validation
How it Works:
- Vector search: Semantic similarity
- Text search: Keyword/BM25 matching
- Graph traversal: Follow entity relationships in knowledge graph
- RRF fusion: Combine results from all 3 methods
- Generate answer from fused results
Strengths:
- Best for complex entity relationships
- Captures connections vector search misses
- Hybrid retrieval (semantic + lexical + graph)
- Superior performance on knowledge-intensive tasks
Weaknesses:
- Requires knowledge graph construction (entity extraction)
- Slowest pipeline (3 retrieval methods + fusion)
- Higher memory requirements (vector + text + graph indices)
- Requires iris-vector-graph library
Configuration:
pipeline = create_pipeline(
'graphrag',
validate_requirements=True
)
# Automatically uses:
# - Vector index for semantic search
# - Text index for keyword search
# - Knowledge graph for entity-based retrieval
result = pipeline.query("What medications interact with metformin?", top_k=5)Performance:
- Retrieval time: 800-2000ms
- Memory: ~2GB (all indices)
- Accuracy: 90-95% (best for entity-heavy queries)
Best Use Cases:
- Medical knowledge (drug interactions, disease relationships)
- Research papers (author networks, citation graphs)
- Legal case law (precedent relationships)
- Enterprise knowledge bases (org charts, product hierarchies)
Requirements:
# Install GraphRAG dependencies
pip install iris-vector-rag[hybrid-graphrag]Knowledge Graph Construction:
from iris_vector_rag.pipelines.graphrag import HybridGraphRAGPipeline
# Entities and relationships extracted during document loading
pipeline.load_documents(documents=docs)
# → Builds knowledge graph: entities + relationshipsHow it Works:
- Generate 3-5 variations of user query (using LLM)
- Retrieve documents for each query variation
- Fuse results using Reciprocal Rank Fusion (RRF)
- Generate answer from fused results
Strengths:
- Robust to query phrasing
- Higher recall (finds more relevant docs via variations)
- Handles ambiguous queries better
Weaknesses:
- Higher LLM costs (query generation step)
- Slower than basic (3-5x due to multiple retrievals)
- May retrieve redundant documents
Configuration:
pipeline = create_pipeline(
'multi_query_rrf',
validate_requirements=True,
num_query_variations=3 # Generate 3 variations
)
result = pipeline.query("diabetes treatment", top_k=5)
# Query variations might be:
# - "What are the treatment options for diabetes?"
# - "How is diabetes managed and treated?"
# - "What medications are used for diabetes?"Performance:
- Retrieval time: 600-1200ms
- Memory: ~500MB
- Accuracy: 82-88% (better recall than basic)
Best Use Cases:
- Ambiguous user queries
- Exploratory search ("Tell me about X")
- Diverse document collections
- When users may phrase questions poorly
How it Works:
- Encode query into multiple token-level embeddings
- Encode documents into token-level embeddings
- Compute late interaction (max-sim over token pairs)
- Retrieve documents with highest late interaction scores
Strengths:
- Fine-grained relevance (token-level matching)
- Better than vector for long documents
- Captures term importance
Weaknesses:
- Slower than basic vector search
- Higher storage requirements (multiple embeddings per document)
- Requires specialized index structure
Configuration:
pipeline = create_pipeline(
'pylate_colbert',
validate_requirements=True
)
result = pipeline.query("What are the side effects of metformin?", top_k=5)Performance:
- Retrieval time: 400-800ms
- Memory: ~1.2GB
- Accuracy: 85-92% (better than basic for complex queries)
Best Use Cases:
- Long documents (research papers, legal documents)
- Queries with specific terms that must match
- When document structure matters
- Fine-grained relevance requirements
Need RAG pipeline?
│
├─ Just getting started? → basic
│
├─ High accuracy required?
│ ├─ Yes, and speed is OK? → basic_rerank
│ └─ Yes, and need entity reasoning? → graphrag
│
├─ Knowledge may be outdated?
│ └─ Yes, need external sources? → crag
│
├─ Users phrase queries poorly?
│ └─ Yes, need query robustness? → multi_query_rrf
│
└─ Long documents with specific terms?
└─ Yes, need token-level matching? → pylate_colbert
| Pipeline | Retrieval Time | Memory | Accuracy | Cost (LLM calls) |
|---|---|---|---|---|
| basic | 100-300ms | 500MB | 70-80% | 1 (generation only) |
| basic_rerank | 300-600ms | 1.5GB | 80-90% | 1 (generation only) |
| crag | 500-1500ms | 800MB | 85-95% | 2-3 (eval + generation) |
| graphrag | 800-2000ms | 2GB | 90-95% | 1 (generation only) |
| multi_query_rrf | 600-1200ms | 500MB | 82-88% | 2-4 (query gen + generation) |
| pylate_colbert | 400-800ms | 1.2GB | 85-92% | 1 (generation only) |
| Domain | Recommended Pipeline | Accuracy Gain vs basic |
|---|---|---|
| Medical Q&A | graphrag | +20% (entity relationships) |
| Legal Research | basic_rerank | +15% (precision matters) |
| Current Events | crag | +18% (web fallback) |
| Research Papers | graphrag | +22% (citation graphs) |
| Customer Support | basic_rerank | +12% (accuracy > speed) |
| General Q&A | basic | baseline |
All pipelines share the same API, making it easy to experiment:
from iris_vector_rag import create_pipeline
# Start with basic
pipeline = create_pipeline('basic')
result_basic = pipeline.query("What is diabetes?", top_k=5)
# Upgrade to basic_rerank for better accuracy
pipeline = create_pipeline('basic_rerank')
result_rerank = pipeline.query("What is diabetes?", top_k=5)
# Try graphrag for entity reasoning
pipeline = create_pipeline('graphrag')
result_graph = pipeline.query("What is diabetes?", top_k=5)
# Compare results
print(f"Basic answer: {result_basic['answer']}")
print(f"Rerank answer: {result_rerank['answer']}")
print(f"Graph answer: {result_graph['answer']}")| Pipeline | Setup Complexity | Learning Curve | Dev Time |
|---|---|---|---|
| basic | Low | 1 hour | 1 day |
| basic_rerank | Low | 2 hours | 1 day |
| crag | Medium | 4 hours | 2 days |
| graphrag | High | 8 hours | 3-5 days |
| multi_query_rrf | Medium | 3 hours | 2 days |
| pylate_colbert | Medium | 3 hours | 2 days |
| Pipeline | Compute | LLM API | Storage | Total/Month |
|---|---|---|---|---|
| basic | $5 | $20 | $2 | $27 |
| basic_rerank | $15 | $20 | $5 | $40 |
| crag | $10 | $60 | $2 | $72 |
| graphrag | $25 | $20 | $10 | $55 |
| multi_query_rrf | $12 | $80 | $2 | $94 |
| pylate_colbert | $18 | $20 | $8 | $46 |
Assumptions: AWS t3.large instance, OpenAI gpt-3.5-turbo, 10k documents
Effort: Low (1 line change) Benefit: +10-15% accuracy
# Before
pipeline = create_pipeline('basic')
# After
pipeline = create_pipeline('basic_rerank')Effort: High (requires graph construction) Benefit: +20-25% accuracy on entity-heavy queries
# Before
pipeline = create_pipeline('basic')
# After
pipeline = create_pipeline('graphrag')
# Note: Requires entity extraction and graph buildingEffort: Medium (requires web search API key) Benefit: +15-20% accuracy on dynamic knowledge
# Before
pipeline = create_pipeline('basic')
# After
pipeline = create_pipeline('crag', web_search_api_key="your-key")pipeline = create_pipeline(
'basic',
validate_requirements=True,
auto_setup=False
)pipeline = create_pipeline(
'basic_rerank',
validate_requirements=True,
candidate_k=50, # Retrieve 50, rerank to top_k
)pipeline = create_pipeline(
'crag',
validate_requirements=True,
web_search_api_key=os.getenv("TAVILY_API_KEY"),
relevance_threshold=0.7, # Trigger web search if below threshold
)pipeline = create_pipeline(
'graphrag',
validate_requirements=True,
entity_extraction_model="en_core_web_sm",
enable_hybrid_search=True, # Vector + text + graph
)→ Start with basic, then upgrade to basic_rerank if accuracy is insufficient.
- Try basic_rerank (+10-15% accuracy)
- If entity-heavy, try graphrag (+20% for entities)
- If knowledge is outdated, try crag (web fallback)
- Use basic instead of graphrag (3-5x faster)
- Reduce top_k (fewer documents to process)
- Use smaller LLM model (gpt-3.5 instead of gpt-4)
→ Use multi_query_rrf (generates query variations automatically)
→ Use pylate_colbert (token-level matching)
- User Guide - Complete iris-vector-rag usage guide
- API Reference - Full API documentation
- IRIS EMBEDDING Guide - Auto-vectorization setup
- Performance Tuning - Optimization best practices