Evaluation Metrics

To know if your RAG system is improving, you need metrics. RAG evaluation measures two components: retrieval (how well relevant documents are found) and generation (how faithful and relevant the answer is).

Retrieval metrics: Hit Rate, MRR, NDCG. Generation metrics: Faithfulness, Answer Relevancy, Context Relevancy.

Retrieval Metrics

Hit Rate: Percentage of queries where at least one relevant document is in the top‑k retrieved. Simple and intuitive.
Mean Reciprocal Rank (MRR): Average of 1/rank of the first relevant document. Measures how early the first relevant doc appears.
Normalized Discounted Cumulative Gain (NDCG): Considers ranking of multiple relevant documents, with position discounts.

Generation Metrics (Using LLM as Judge)

Faithfulness: Does the answer stay consistent with the retrieved context? No hallucinations.
Answer Relevancy: Is the answer relevant to the original question?
Context Relevancy: Is the retrieved context relevant to the question?

Implementation Example (Ragas)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy

result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_relevancy]
)
print(result)

Why Metrics Matter

Without metrics, you are guessing. Metrics guide optimisation – improve retrieval (hybrid search, reranking) or generation (prompt tuning, better LLM).

Two Minute Drill

Retrieval metrics: Hit Rate, MRR, NDCG.
Generation metrics: Faithfulness, Answer Relevancy, Context Relevancy.
Use frameworks like Ragas or TruLens.
Evaluate both retrieval and generation separately.

Need more clarification?

Drop us an email at career@quipoinfotech.com

Welcome to Quipoin

Quipoin Menu

Evaluation Metrics

Need more clarification?