AI

RAG Evals: How to Evaluate Retrieval-Augmented Generation Systems

JJulia
December 12, 2025
13 min read
RAG Evals: How to Evaluate Retrieval-Augmented Generation Systems

By the end of this, you'll know:

  • Why RAG Evaluation is Hard
  • How RAG Retrieval Actually Works
  • The Four Core RAG Metrics
  • Retrieval Evaluation
  • Generation Evaluation
  • RAG Eval Frameworks
  • Building a RAG Eval Suite in Practice

#RAG Evals: How to Evaluate Retrieval-Augmented Generation Systems

RAG systems fail in ways that are hard to detect. The retrieval step might return the wrong chunks. The model might ignore the retrieved content and hallucinate anyway. The answer might be technically grounded in the retrieved text but still misleading. Or the retrieved content might be accurate but not relevant to what the user actually asked.

None of these failures show up in a simple "did the answer match?" evaluation. RAG evals require measuring the retrieval and generation steps separately - and understanding which component is causing the failures you observe.

#Why RAG Evaluation is Hard

A traditional ML model is evaluated against labeled test data. You know the correct answer; you measure how often the model gets it right. Precision, recall, F1, RMSE - the metrics are well-defined.

RAG evaluation is harder for three reasons:

1. There's no single "correct" answer for most questions. A question answered correctly might have many valid phrasings. Exact match doesn't work; string similarity doesn't work well either.

2. The pipeline has two failure modes that need separate measurement. Retrieval failures (wrong chunks retrieved) are different from generation failures (model ignores or misuses retrieved content). Measuring only the final answer conflates both.

3. Ground truth is expensive to produce. Creating a labeled evaluation dataset for a RAG system requires human annotators who can assess: did the retrieved chunks contain the relevant information? Is the generated answer faithful to the retrieved content? Is the answer correct? This is time-consuming and requires domain expertise.

The result: many teams deploy RAG systems without adequate evaluation, discover quality problems in production, and can't diagnose which component is failing.

#How RAG Retrieval Actually Works

Before evaluating a RAG system, it helps to understand what the retrieval step is actually doing - because different retrieval architectures fail in different ways, and those failures require different evaluation approaches.

#From Documents to Vectors

When you index a document, it goes through three transformations:

  1. Chunking: The document is split into overlapping text segments - typically 500–1,000 tokens each, with overlap at boundaries to preserve context that would otherwise be split mid-concept.
  2. Embedding: Each chunk is converted into a vector representation using an embedding model. The vector captures semantic meaning: similar content ends up geometrically close in the vector space, regardless of exact wording. This is what enables semantic search - "contract termination" and "service cancellation" map to nearby vectors even though the words differ.
  3. Knowledge graph extraction: Alongside chunk vectors, an AI model automatically extracts entities - people, organizations, concepts, events, locations - and the relationships between them. This graph captures what is in your documents: not just the raw text, but the connections between the things the text describes.

Both structures - the chunk vector index and the knowledge graph - are stored and used during retrieval.

#The Four Retrieval Modes

Modern RAG systems don't all retrieve the same way. Four distinct strategies cover the main use cases:

Naive - Direct vector similarity. Your query is embedded and compared against all chunk vectors. The closest matches are returned. Fast and accurate for specific factual lookups. Breaks down when relevant information is distributed across documents or hidden in entity relationships.

Local - Chunk retrieval plus graph expansion. Starts from the most relevant text chunks, then expands outward through the knowledge graph - pulling in entities and relationships connected to those chunks. Better for questions about specific people, projects, or events where surrounding context matters.

Global - Graph-first. Finds the most connected entities relevant to your query in the knowledge graph, then retrieves text mentioning those entities. Strong for cross-document synthesis and summarization; slower for specific factual lookups.

Hybrid - Adaptive combination. Evaluates the incoming query and dynamically selects the most effective strategy, combining local chunk retrieval with global graph context. Highest accuracy across query types; the recommended default for production systems.

The retrieval mode is not just a performance setting - it changes what your system is actually retrieving. Naive RAG retrieves text chunks. Hybrid RAG retrieves text chunks plus graph-structured context: entity descriptions, relationship summaries, and cross-document community clusters. That distinction is fundamental for how you evaluate.

#Why the Retrieval Mode Changes What You Need to Measure

The standard four RAG metrics - faithfulness, answer relevance, context precision, context recall - were developed with naive chunk retrieval in mind. They measure whether the right text segments were returned, and whether the model used them correctly. Add graph-based retrieval and two problems emerge.

The metrics become incomplete. In local and hybrid modes, the model also receives graph context - entity descriptions and relationship summaries - that doesn't register as a "retrieved chunk." A system with a context precision of 0.5 (only half the retrieved chunks are relevant) might still produce accurate answers because graph context filled the gap. Your standard metrics won't see this.

A new failure mode appears. The knowledge graph is built automatically during indexing by an AI model extracting entities and relationships. If that extraction was noisy - wrong entity types, missed relationships, conflated entities - then local and global retrieval will silently surface bad context. Chunk-level metrics won't catch this at all.

The practical implication: establish which retrieval mode your system uses before designing your eval suite. Naive RAG and hybrid RAG need different test sets, different metrics, and different diagnostic approaches. The sections below cover the standard framework - keep in mind that graph-based modes require extending it.

#The Four Core RAG Metrics

The RAG evaluation community has converged on four core metrics that cover the most important failure modes:

#1. Answer Faithfulness

What it measures: Does the generated answer stay within what the retrieved context actually says? Or does the model add information not present in the retrieved chunks?

Why it matters: High faithfulness means the model is using the retrieved content correctly and not hallucinating. Low faithfulness means the model is generating answers from its training data despite having retrieved context - the RAG is not working as intended.

How it's measured: Compare each claim in the generated answer against the retrieved context. What fraction of claims can be directly traced to the context?

#2. Answer Relevance

What it measures: Does the generated answer actually address the user's question? A faithful, accurate answer that doesn't address the question is still a failure.

Why it matters: Distinguishes between "the answer is grounded in retrieved content" (faithfulness) and "the answer is useful for the user" (relevance).

How it's measured: Score the semantic similarity between the answer and the original question. Often done using an LLM as judge.

#3. Context Precision

What it measures: Of the retrieved chunks, what fraction are actually relevant to answering the question?

Why it matters: High precision means the retrieval step is returning relevant content. Low precision means the model is receiving noisy context - some retrieved chunks are irrelevant - which degrades generation quality.

How it's measured: For each retrieved chunk, assess whether it contains information relevant to the question. Fraction of relevant chunks / total retrieved chunks.

#4. Context Recall

What it measures: Was all the information needed to answer the question present in the retrieved chunks?

Why it matters: High recall means the retrieval step found everything the model needed. Low recall means relevant information existed in the knowledge base but wasn't retrieved - the retrieval missed it.

How it's measured: Compare the retrieved chunks against the ground-truth answer. What fraction of the information needed to generate the correct answer was present in the retrieved context?

#Retrieval Evaluation

Retrieval evaluation focuses on the quality of the chunks returned by the vector search step, independent of generation.

Metrics:

  • Hit rate / Recall@k: For a given question with a known relevant document, does the relevant document appear in the top-k retrieved results?
  • Mean Reciprocal Rank (MRR): On average, what rank does the most relevant retrieved chunk hold?
  • Normalized Discounted Cumulative Gain (NDCG): A graded relevance metric that gives more credit to highly relevant chunks ranked higher.

What you need to run retrieval eval:

  • A set of test questions
  • For each question, the correct source document(s) that should be retrieved
  • Your retrieval system

This can be evaluated without any LLM - purely by checking whether the retrieval step returns the right documents.

Common retrieval failure modes:

  • Vocabulary mismatch: user asks about "contract termination"; the relevant document uses "service cancellation". Semantic search helps but doesn't fully solve this.
  • Chunking boundary: the relevant information spans two chunks, and neither chunk individually contains enough context to be ranked highly.
  • Embedding quality: the embedding model doesn't represent the domain vocabulary well.

For local, global, and hybrid modes, chunk-level metrics are necessary but not sufficient. You also need to evaluate the knowledge graph layer that these modes rely on:

  • Entity extraction precision: of the entities extracted during indexing, what fraction are correctly identified - right name, right type, right scope? Sample a subset of your documents, run the extractor, and compare against the source text.
  • Relationship accuracy: of the extracted relationships, what fraction correctly describe the actual connection between entities? A spurious or inverted relationship ("Company A acquired Company B" vs. the reverse) will silently corrupt any query that traverses that edge.
  • Graph coverage: are the most important concepts in your knowledge base well-represented in the graph, or are key entities missing? Low graph coverage means global and hybrid modes will fail to surface relevant context even when it exists in the indexed documents.

A noisy knowledge graph quietly degrades every graph-assisted retrieval mode. It won't show up in your chunk-level metrics - it will just look like the model is underperforming without an obvious cause.

#Generation Evaluation

Generation evaluation focuses on what the model does with the retrieved context.

Reference-free evaluation (no human labels needed): Use an LLM as a judge. Present the LLM evaluator with: the original question, the retrieved context, and the generated answer. Ask it to assess faithfulness ("does the answer stay within the context?"), relevance ("does the answer address the question?"), and overall quality.

This is the most scalable approach for continuous evaluation - you don't need human labels for every test case. But LLM-as-judge introduces its own biases (preferring verbose answers, favoring its own output patterns) and is not perfectly reliable.

Reference-based evaluation (human labels required): Compare generated answers against gold-standard human-written answers using metrics like ROUGE, BERTScore, or semantic similarity. More reliable for correctness evaluation, but requires the upfront investment of creating the reference answers.

Hallucination detection: Use an NLI (natural language inference) model or an LLM judge to classify each sentence in the generated answer as: supported by the retrieved context, contradicted by the retrieved context, or not addressable from the context.

#RAG Eval Frameworks

Several open-source frameworks implement RAG evaluation metrics:

RAGAS (Retrieval Augmented Generation Assessment): The most widely used RAG-specific evaluation framework. Implements the four core metrics out of the box using LLM-as-judge internally. The metric implementations are precise enough to be worth understanding:

  • Faithfulness: breaks the generated answer into individual claims, then checks each claim against the retrieved context. Score = verifiable claims / total claims.
  • Answer relevance: prompts an LLM to generate several questions based on the answer, then measures mean cosine similarity between those generated questions and the original question. Answers that drift off-topic score low even if they're factually accurate.
  • Context precision: evaluates not just whether relevant chunks were retrieved, but where they rank. Relevant chunks buried at position 8 of 10 hurt the score more than the same chunks ranked 1 and 2 - because the model's attention degrades with position.
  • Context recall: takes each claim in the ground-truth answer and checks whether it can be attributed to the retrieved context. Measures completeness rather than precision.

Open source, integrates with LangChain and LlamaIndex. Also includes synthetic test set generation - useful for bootstrapping your eval dataset without manual annotation.

DeepEval: A broader LLM evaluation framework that includes RAG-specific metrics alongside general LLM evaluation capabilities. Supports custom metrics and CI/CD integration.

TruLens: Focused on LLM app evaluation with particular strength in the "triad" of context relevance, groundedness, and answer relevance. Good observability tooling.

LlamaIndex Evaluation: Evaluation modules built into LlamaIndex that assess retrieval and generation quality within the LlamaIndex ecosystem.

#Building a RAG Eval Suite in Practice

A practical RAG eval setup:

Step 1: Build a test set. Create 50-200 question-answer pairs that cover the range of questions your users actually ask. Include the source document(s) for each question (for retrieval eval). Generate or write reference answers (for generation eval).

The test set is your most valuable eval asset. Invest in making it representative and high-quality.

Step 2: Evaluate retrieval first. Before evaluating generation, verify that your retrieval is working. Measure hit rate and context recall against your test set. Fix retrieval problems (chunking strategy, embedding model, hybrid search) before moving on. Generation quality cannot exceed retrieval quality.

Step 3: Evaluate generation. With retrieval working well, evaluate faithfulness and answer relevance using an LLM-as-judge setup. Flag answers with low faithfulness scores for human review - these are your hallucinations.

Step 4: Run evals continuously. RAG system quality degrades as the knowledge base grows and changes. Run your eval suite on every significant update to the knowledge base or retrieval configuration. Treat RAG eval like unit tests for traditional software.

Step 5: Identify the failure pattern. When overall performance drops, use component-level metrics to diagnose: is it a retrieval problem (context precision or recall dropped) or a generation problem (faithfulness dropped despite good retrieval)?

Evaluation discipline is what separates RAG systems that work reliably in production from those that looked good in demos.

Build your own production-ready RAG in under 5 minutes

Try it free

Recommended reads

Data is your goldmine. Start mining today.

No credit card required.

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern
STRG + BSidepanel umschalten

Software-Details
Kompiliert vor 3 Tagen
Release: v4.0.0-production
Buildnummer: master@0a19450
Historie: 42 Items