Evaluation Metrics
To know if your RAG system is improving, you need metrics. RAG evaluation measures two components: retrieval (how well relevant documents are found) and generation (how faithful and relevant the answer is).
Retrieval metrics: Hit Rate, MRR, NDCG. Generation metrics: Faithfulness, Answer Relevancy, Context Relevancy.
Retrieval Metrics
- Hit Rate: Percentage of queries where at least one relevant document is in the top‑k retrieved. Simple and intuitive.
- Mean Reciprocal Rank (MRR): Average of 1/rank of the first relevant document. Measures how early the first relevant doc appears.
- Normalized Discounted Cumulative Gain (NDCG): Considers ranking of multiple relevant documents, with position discounts.
Generation Metrics (Using LLM as Judge)
- Faithfulness: Does the answer stay consistent with the retrieved context? No hallucinations.
- Answer Relevancy: Is the answer relevant to the original question?
- Context Relevancy: Is the retrieved context relevant to the question?
Implementation Example (Ragas)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_relevancy]
)
print(result)Why Metrics Matter
Without metrics, you are guessing. Metrics guide optimisation – improve retrieval (hybrid search, reranking) or generation (prompt tuning, better LLM).
Two Minute Drill
- Retrieval metrics: Hit Rate, MRR, NDCG.
- Generation metrics: Faithfulness, Answer Relevancy, Context Relevancy.
- Use frameworks like Ragas or TruLens.
- Evaluate both retrieval and generation separately.
Need more clarification?
Drop us an email at career@quipoinfotech.com
