Retrieval Pipeline

How the RAG system finds relevant content for chat queries.

Pipeline Overview

User Query → Query Processing → Hybrid Search → Ranking → Context → Response
                ↓                                    ↓
         Query Expansion                    Reciprocal Rank Fusion
                ↓                                    ↓
         Embed Generation                   Cross-Encoder Rerank

1. Query Processing

  • Clean and normalize input
  • Extract key terms for keyword search
  • Generate query variations for better recall

2. Embedding Generation

  • Model: Azure OpenAI text-embedding-3-small
  • Dimensions: 1536

Caching Strategy

Redis (hot) → PostgreSQL (cold) → API
   1 hour         30 days

3. Hybrid Search

Vector Search (pgvector)

Semantic similarity using HNSW index.

Keyword Search (Postgres TSVECTOR)

Full-text search for exact matches.

Why Hybrid?

Search TypeStrengthWeakness
VectorSemantic similarityMay miss exact terms
KeywordExact matchesNo semantic understanding

4. Score Fusion

Reciprocal Rank Fusion (RRF) combines rankings:

score = 1 / (k + rank)

5. Reranking

Cross-encoder reranking for better relevance:

  • Model: BAAI/bge-reranker-base

6. Context Assembly

Token budget allocation:

ComponentBudget
System prompt1000
History8000
Context~110000
Response4000

Performance Targets

MetricTargetAlert
Total Latency< 500ms> 1000ms
Vector Search< 100ms> 200ms
Cache Hit Rate> 70%< 50%