Embedding Quality Metrics

The Problem

Cannot measure embedding quality, making it impossible to know if embeddings accurately capture document semantics or compare embedding models.

Symptoms

❌ Don't know if embeddings are good
❌ Cannot compare embedding models
❌ No metrics for domain adaptation quality
❌ Embedding drift undetected
❌ Cannot validate fine-tuning improvements

Real-World Example

Company considers switching:
→ Current: OpenAI text-embedding-ada-002
→ Candidate: Cohere embed-english-v3.0

Questions:
→ Which performs better for our domain?
→ How much better (quantify)?
→ Worth the migration cost?

No metrics:
→ Cannot answer objectively
→ Guessing based on vibes

Deep Technical Analysis

Intrinsic Metrics

Embedding Dimensionality:

Model A: 768 dimensions
Model B: 1536 dimensions

Higher dimensions:
→ More expressive (can capture nuance)
→ But: Slower search, more storage
→ Not always better

Need: Quality per dimension

Intra-Cluster vs Inter-Cluster Distance:

Good embeddings:
→ Similar docs close together (high intra-cluster similarity)
→ Dissimilar docs far apart (low inter-cluster similarity)

Measure:
→ Silhouette score
→ Davies-Bouldin index

Higher = better embedding quality

Extrinsic Metrics (Task-Based)

Retrieval Accuracy:

Gold standard test set:
→ Query: "API authentication"
→ Relevant docs: [doc_5, doc_12, doc_23]

Measure:
→ Precision@K: Are top-K results relevant?
→ Recall@K: Did we find all relevant docs?
→ NDCG: Ranking quality

Compare models on same test set

Example Evaluation:

Model A (OpenAI):
→ Precision@5: 0.85
→ Recall@10: 0.78
→ NDCG@10: 0.82

Model B (Cohere):
→ Precision@5: 0.88 (+3pp)
→ Recall@10: 0.82 (+4pp)
→ NDCG@10: 0.86 (+4pp)

Model B objectively better

Domain Adaptation Testing

Out-of-Domain Detection:

Test on domain-specific terms:
→ Your jargon: "Redwood Strategy", "Cedar mode"

Generic model:
→ May not understand these terms
→ Low similarity to related concepts

Domain-adapted model:
→ Should recognize: "Redwood" ~ "RAG approach"
→ Higher semantic accuracy

Semantic Similarity Tests:

Pairs that SHOULD be similar:
→ ("login", "authentication") = High similarity expected
→ ("API key", "access token") = High similarity expected

Pairs that SHOULD be dissimilar:
→ ("login", "pricing") = Low similarity expected

Measure:
→ % correct classifications
→ Higher = better embedding quality

Embedding Drift Detection

Temporal Consistency:

Track embedding quality over time:
→ Month 1: NDCG = 0.85
→ Month 6: NDCG = 0.78 (dropped)

Possible causes:
→ Knowledge base changed (new jargon)
→ User queries evolved
→ Embedding model updated

Alerts: Quality degradation

Model Version Comparison:

OpenAI updates model:
→ text-embedding-ada-002 v1
→ text-embedding-ada-002 v2 (silently updated)

Re-evaluate:
→ Did metrics change?
→ Need re-embedding?

Track model versions explicitly

Fine-Tuning Validation

Before vs After:

Base model:
→ NDCG: 0.75

After fine-tuning on your docs:
→ NDCG: 0.82 (+7pp)

Validates fine-tuning effort:
→ Quantifies improvement
→ Justifies cost

How to Solve

Create golden evaluation dataset (query + relevant docs) + measure Precision@K, Recall@K, NDCG@K on eval set + compare models using same metrics + track metrics over time (detect drift) + test semantic similarity on domain-specific term pairs + evaluate fine-tuned models against baseline + monitor embedding model version changes + automate eval runs on each knowledge base update. See Embedding Metrics.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/monitoring/embedding-metrics.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Embedding Quality Metrics

Key Takeaways

The Problem

Symptoms

Real-World Example

Deep Technical Analysis

Intrinsic Metrics

Extrinsic Metrics (Task-Based)

Domain Adaptation Testing

Embedding Drift Detection

Fine-Tuning Validation

How to Solve

Agent Instructions: Querying This Documentation

Related Pages

Comparisons

Compliance

Investors

Industry