Rag Scenarios And Solutions
Similarity Score Calibration
Raw similarity scores (0.0 to 1.0) don't translate to meaningful relevance—a score of 0.75 might be excellent for some queries but poor for others, making threshold-setting impossible.
TL;DR
Raw similarity scores (0.0 to 1.0) don't translate to meaningful relevance—a score of 0.75 might be excellent for some queries but poor for others, making threshold-setting impossible.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Raw similarity scores (0.0 to 1.0) don't translate to meaningful relevance—a score of 0.75 might be excellent for some queries but poor for others, making threshold-setting impossible.
Symptoms
- ❌ Can't set universal relevance threshold
- ❌ Same score means different relevance per query
- ❌ "0.80 similarity" - is this good or bad?
- ❌ Generic docs always score high
- ❌ Specific docs score low despite being perfect matches
Real-World Example
Query A: "API authentication"
Top result: "API Guide" (score: 0.92) ← Excellent match
Query B: "Configure TPS-2000 subsystem"
Top result: "System Configuration" (score: 0.68) ← Also excellent match!
Same threshold (0.75) would:
→ Accept Query A result ✓
→ Reject Query B result ✗ (below 0.75)
But Query B's 0.68 is actually the best possible match
→ Specific technical query
→ Limited vocabulary overlap
→ Lower scores expected
Threshold needs calibration per query type
Deep Technical Analysis
Cosine Similarity Range Compression
Embedding similarity doesn't use full [0,1] range:
Typical Score Distribution:
Theoretical range: 0.0 to 1.0
Actual observed range: 0.40 to 0.95
Why compressed?
→ Embeddings trained to be "similar" to related concepts
→ Dissimilar concepts still share some dimensions
→ Rarely see true 0.0 or 1.0
Practical implications:
→ 0.70 isn't "70% similar"
→ More like "moderately related"
→ 0.90 is "highly related"
→ Non-linear perception
The Dead Zone:
Scores 0.0 - 0.30: Almost never seen
→ Would require completely unrelated concepts
→ "Database" vs "Banana" might be 0.35
→ Still positive dot product from shared abstract concepts
Usable range: 0.40 - 0.95
→ 55% of theoretical range
→ Makes discrimination harder
→ Small score differences matter more
Query-Dependent Score Distributions
Different queries have different score patterns:
Broad vs Narrow Queries:
Broad query: "software"
→ Many documents match somewhat
→ Top score: 0.88
→ 10th score: 0.82
→ Tight distribution (0.06 spread)
→ Hard to distinguish quality
Narrow query: "JWT token expiration configuration"
→ Few documents match
→ Top score: 0.75
→ 10th score: 0.45
→ Wide distribution (0.30 spread)
→ Clear quality differences
Same threshold doesn't work for both
Vocabulary Overlap Effect:
Query with common words: "how to start"
→ "how", "to", "start" appear everywhere
→ High baseline similarity
→ Even irrelevant docs score 0.70+
→ True matches: 0.85+
Query with rare terms: "configure fsync durability"
→ "fsync", "durability" are rare
→ Most docs score 0.50
→ True matches: 0.68+
Absolute scores incomparable across queries
Document-Specific Baseline Scores
Some documents score high regardless of query:
Generic "Hub" Documents:
Document: "Getting Started Guide"
→ Covers many topics broadly
→ Contains keywords from many domains
→ Embedding is "central" in vector space
Matches many queries with scores 0.75-0.85
→ Not because it's relevant
→ Because it's generic
Specific document: "Advanced Kafka Partitioning"
→ Narrow focus
→ Embedding is "peripheral" in vector space
→ Rarely matches above 0.70
→ Even for perfect Kafka queries!
Need document-specific score normalization
Length and Density Bias:
Long document (5000 tokens):
→ Covers many sub-topics
→ Higher chance of keyword overlap
→ Tends to score higher
Short document (200 tokens):
→ Focused single concept
→ Limited vocabulary
→ Tends to score lower
"Getting Started" (long) vs "Error Code 503" (short)
→ Query: "503 error"
→ "Getting Started" scores 0.72 (mentions errors)
→ "Error Code 503" scores 0.70 (specific but short)
Length bias in scoring
Model-Specific Score Ranges
Different embedding models have different scales:
Model Comparison:
Sentence-BERT (all-mpnet-base):
→ Typical scores: 0.50 - 0.85
→ Mean: 0.67
→ Std dev: 0.12
OpenAI ada-002:
→ Typical scores: 0.60 - 0.92
→ Mean: 0.76
→ Std dev: 0.08
Cohere embed-english-v3:
→ Typical scores: 0.45 - 0.88
→ Mean: 0.66
→ Std dev: 0.14
Cannot use same threshold across models
→ 0.75 is "good" for Sentence-BERT
→ 0.75 is "mediocre" for OpenAI
Training Data Effects:
Model trained on:
→ Wikipedia + Books: Formal, diverse
→ Produces wider score range
Model trained on:
→ Q&A pairs: Focused, similar structure
→ Produces narrower score range
Score distribution reflects training data
Calibration Techniques
Methods to normalize scores:
Z-Score Normalization:
Per-query calibration:
1. Retrieve top 100 candidates
2. Compute mean μ and std dev σ of their scores
3. Normalize each score: z = (score - μ) / σ
Interpretation:
→ z = 0: Average relevance for this query
→ z = 2: 2 standard deviations above average (very relevant)
→ z = -1: Below average (likely irrelevant)
Threshold becomes: z > 1.5 (e.g.)
→ Adapts to query-specific distribution
Percentile-Based Cutoff:
Instead of absolute threshold:
→ "Return top 90th percentile"
Query A: 90th percentile = 0.87
Query B: 90th percentile = 0.72
Adaptive threshold
→ Accounts for query difficulty
→ Always returns "best" results relatively
BM25 Hybrid Scoring:
Combine semantic similarity with keyword match:
→ Semantic score: 0.75
→ BM25 score: 12.3 (different scale)
Normalized combination:
→ norm_semantic = (semantic - μ_semantic) / σ_semantic
→ norm_bm25 = (bm25 - μ_bm25) / σ_bm25
→ final = 0.7 × norm_semantic + 0.3 × norm_bm25
Both scores on same scale
→ Meaningful combination
Learning-to-Rank Approaches
ML-based score calibration:
Supervised Calibration:
Collect training data:
→ (query, document, score) → relevance label (0-4)
Examples:
→ ("API auth", "OAuth Guide", 0.82) → 4 (highly relevant)
→ ("API auth", "Pricing Page", 0.68) → 1 (low relevance)
Train model:
→ Input: [query_emb, doc_emb, cosine_score]
→ Output: Calibrated score (0-1)
Learns:
→ When 0.68 is actually good (narrow query)
→ When 0.82 is actually mediocre (generic doc)
Requires labeled data
Implicit Feedback Calibration:
Use click-through data:
→ User queries, results shown, which clicked
Learn:
→ (query, doc, score=0.75) → clicked 80% of time
→ (query, doc, score=0.85) → clicked 60% of time
Calibrate:
→ 0.75 for this query is actually better than 0.85
→ Adjust scoring function
Automatic, no manual labels
→ But requires traffic
Context-Dependent Thresholding
Adaptive thresholds based on context:
Query Type Detection:
Classify query:
→ Factual: "What is X?" → Threshold 0.80
→ Navigational: "Find X guide" → Threshold 0.70
→ Exploratory: "Learn about X" → Threshold 0.65
Different query types need different confidence levels
User Intent Inference:
User history:
→ Clicked 5 specific technical docs
→ Ignored general overviews
Personalize threshold:
→ For this user: Prefer specific docs
→ Boost scores of narrow docs
→ Penalize generic docs
Per-user calibration
Temporal Adjustment:
Query: "latest API updates"
→ Boost recent documents
→ Effective score = base_score × recency_weight
→ Fresh docs score higher
Query: "historical pricing"
→ Don't penalize old docs
→ Use base score as-is
Context-aware score adjustment
Multi-Modal Score Fusion
Combining different signals:
Signal Types:
1. Semantic similarity: 0.75
2. Keyword match (BM25): 12.3
3. Document popularity: 0.90 (90th percentile)
4. Freshness: 0.60 (60 days old)
5. User preference: 0.80 (user often reads this type)
Question: How to combine?
→ Each has different scale
→ Different meaning
Fusion Strategies:
Simple weighted average (bad):
→ 0.7×semantic + 0.2×popularity + 0.1×freshness
→ But scales incompatible
→ Semantic in [0.4, 0.9]
→ Popularity in [0, 1]
Better: Normalize first
→ z_semantic = (semantic - μ) / σ
→ z_popularity = (popularity - μ) / σ
→ z_freshness = (freshness - μ) / σ
→ final = 0.7×z_semantic + 0.2×z_popularity + 0.1×z_freshness
All on same scale
How to Solve
Use per-query z-score normalization + set percentile-based thresholds (e.g., top 20%) instead of absolute scores + track score distributions per query type + implement learning-to-rank with click data + normalize document-specific baseline scores. See Score Calibration.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/vectors/similarity-calibration.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Comparisons
Last updated January 26, 2026


