Rag Scenarios And Solutions
Embedding Quality Metrics
Cannot measure embedding quality, making it impossible to know if embeddings accurately capture document semantics or compare embedding models.
TL;DR
Cannot measure embedding quality, making it impossible to know if embeddings accurately capture document semantics or compare embedding models.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Cannot measure embedding quality, making it impossible to know if embeddings accurately capture document semantics or compare embedding models.
Symptoms
- ❌ Don't know if embeddings are good
- ❌ Cannot compare embedding models
- ❌ No metrics for domain adaptation quality
- ❌ Embedding drift undetected
- ❌ Cannot validate fine-tuning improvements
Real-World Example
Company considers switching:
→ Current: OpenAI text-embedding-ada-002
→ Candidate: Cohere embed-english-v3.0
Questions:
→ Which performs better for our domain?
→ How much better (quantify)?
→ Worth the migration cost?
No metrics:
→ Cannot answer objectively
→ Guessing based on vibes
Deep Technical Analysis
Intrinsic Metrics
Embedding Dimensionality:
Model A: 768 dimensions
Model B: 1536 dimensions
Higher dimensions:
→ More expressive (can capture nuance)
→ But: Slower search, more storage
→ Not always better
Need: Quality per dimension
Intra-Cluster vs Inter-Cluster Distance:
Good embeddings:
→ Similar docs close together (high intra-cluster similarity)
→ Dissimilar docs far apart (low inter-cluster similarity)
Measure:
→ Silhouette score
→ Davies-Bouldin index
Higher = better embedding quality
Extrinsic Metrics (Task-Based)
Retrieval Accuracy:
Gold standard test set:
→ Query: "API authentication"
→ Relevant docs: [doc_5, doc_12, doc_23]
Measure:
→ Precision@K: Are top-K results relevant?
→ Recall@K: Did we find all relevant docs?
→ NDCG: Ranking quality
Compare models on same test set
Example Evaluation:
Model A (OpenAI):
→ Precision@5: 0.85
→ Recall@10: 0.78
→ NDCG@10: 0.82
Model B (Cohere):
→ Precision@5: 0.88 (+3pp)
→ Recall@10: 0.82 (+4pp)
→ NDCG@10: 0.86 (+4pp)
Model B objectively better
Domain Adaptation Testing
Out-of-Domain Detection:
Test on domain-specific terms:
→ Your jargon: "Redwood Strategy", "Cedar mode"
Generic model:
→ May not understand these terms
→ Low similarity to related concepts
Domain-adapted model:
→ Should recognize: "Redwood" ~ "RAG approach"
→ Higher semantic accuracy
Semantic Similarity Tests:
Pairs that SHOULD be similar:
→ ("login", "authentication") = High similarity expected
→ ("API key", "access token") = High similarity expected
Pairs that SHOULD be dissimilar:
→ ("login", "pricing") = Low similarity expected
Measure:
→ % correct classifications
→ Higher = better embedding quality
Embedding Drift Detection
Temporal Consistency:
Track embedding quality over time:
→ Month 1: NDCG = 0.85
→ Month 6: NDCG = 0.78 (dropped)
Possible causes:
→ Knowledge base changed (new jargon)
→ User queries evolved
→ Embedding model updated
Alerts: Quality degradation
Model Version Comparison:
OpenAI updates model:
→ text-embedding-ada-002 v1
→ text-embedding-ada-002 v2 (silently updated)
Re-evaluate:
→ Did metrics change?
→ Need re-embedding?
Track model versions explicitly
Fine-Tuning Validation
Before vs After:
Base model:
→ NDCG: 0.75
After fine-tuning on your docs:
→ NDCG: 0.82 (+7pp)
Validates fine-tuning effort:
→ Quantifies improvement
→ Justifies cost
How to Solve
Create golden evaluation dataset (query + relevant docs) + measure Precision@K, Recall@K, NDCG@K on eval set + compare models using same metrics + track metrics over time (detect drift) + test semantic similarity on domain-specific term pairs + evaluate fine-tuned models against baseline + monitor embedding model version changes + automate eval runs on each knowledge base update. See Embedding Metrics.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/monitoring/embedding-metrics.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Comparisons
Last updated January 26, 2026


