The Essential RAG Book

Evaluation Metrics

Evaluation of Retrieval-Augmented Generation (RAG) systems requires measuring both retrieval quality and generation quality. Unlike pure retrievers or language models, RAG introduces interactions between components that affect factuality, grounding, and completeness. Comprehen...

TL;DR

Evaluation of Retrieval-Augmented Generation (RAG) systems requires measuring both retrieval quality and generation quality. Unlike pure retrievers or language models, RAG introduces interactions between components that affect factuality, grounding, and completeness. Comprehensive evaluation thus involves intrinsic,...

Key Takeaways

  • Evaluation of Retrieval-Augmented Generation (RAG) systems requires measuring both retrieval quality and generation quality.

Evaluation of Retrieval-Augmented Generation (RAG) systems requires measuring both retrieval quality and generation quality. Unlike pure retrievers or language models, RAG introduces interactions between components that affect factuality, grounding, and completeness. Comprehensive evaluation thus involves intrinsic, extrinsic, and human-centered metrics.

┌───────────────────┐
│ Retriever Metrics │
└───────────────────┘
          ↓
┌───────────────────┐
│ Generator Metrics │
└───────────────────┘
          ↓
   ┌────────────┐
   │ Human Eval │
   └────────────┘
Retriever: Recall@k, Precision, MRR Generator: Faithfulness, Factuality, BLEU, ROUGE Human: Correctness, Helpfulness, Readability
          ↓
Combined Score = α·Retrieval + β·Generation + γ·Human
Figure 14: Multi-level RAG evaluation pipeline with retriever, generator, and human layers

1. Retrieval metrics. Evaluate the retriever's ability to surface relevant context for each

query. Typical metrics include Recall@k (coverage of gold evidence), Precision@k, Mean Reciprocal Rank (MRR), and NDCG. High Recall@k ensures grounding potential, while MRR captures rank sensitivity.

2. Generation metrics. Assess output text quality. Intrinsic metrics like BLEU,

ROUGE-L, and METEOR quantify lexical overlap. However, these can miss semantic alignment, so newer models use factual consistency scores (FactCC, QAGS, or GPT-based judge models). Faithfulness measures how well answers align with retrieved evidence rather than hallucinated content.

3. End-to-end metrics. Composite metrics evaluate the full pipeline. 'Groundedness'

and 'Answer Support Rate' assess if generated answers can be justified from retrieved context. 'Answer Completeness' evaluates recall of multi-fact responses. Automatic evaluation frameworks like TruLens and RAGAS combine these signals.

4. Latency and cost metrics. Operational metrics like end-to-end latency, token usage,

and retrieval time help balance quality with throughput. These are crucial for production-grade deployments where cost per query matters.

5. Human evaluation. Human raters assess correctness, helpfulness, and clarity using

Likert or pairwise scales. Hybrid pipelines often calibrate automatic metrics against human

judgment baselines to ensure reliability.

6. Evaluation frameworks. Tools like LlamaIndex's EvalSuite, LangChain's QA Eval,

and OpenAI's Evals automate dataset-level testing. Academic work explores dynamic benchmarks (REALM, KILT, and BEIR) that include retrieval and generation tasks jointly. When to use. Evaluation metrics guide iteration. Use retrieval-focused metrics early in development, end-to-end metrics during model tuning, and human evaluations for final validation in customer-facing systems.

People also ask

Related Pages