Loading

Quipoin Menu

Learn • Practice • Grow

rag / Evaluation Metrics
tutorial

Evaluation Metrics

To know if your RAG system is improving, you need metrics. RAG evaluation measures two components: retrieval (how well relevant documents are found) and generation (how faithful and relevant the answer is).

Retrieval metrics: Hit Rate, MRR, NDCG. Generation metrics: Faithfulness, Answer Relevancy, Context Relevancy.

Retrieval Metrics

  • Hit Rate: Percentage of queries where at least one relevant document is in the top‑k retrieved. Simple and intuitive.
  • Mean Reciprocal Rank (MRR): Average of 1/rank of the first relevant document. Measures how early the first relevant doc appears.
  • Normalized Discounted Cumulative Gain (NDCG): Considers ranking of multiple relevant documents, with position discounts.

Generation Metrics (Using LLM as Judge)

  • Faithfulness: Does the answer stay consistent with the retrieved context? No hallucinations.
  • Answer Relevancy: Is the answer relevant to the original question?
  • Context Relevancy: Is the retrieved context relevant to the question?

Implementation Example (Ragas)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy

result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_relevancy]
)
print(result)

Why Metrics Matter

Without metrics, you are guessing. Metrics guide optimisation – improve retrieval (hybrid search, reranking) or generation (prompt tuning, better LLM).


Two Minute Drill
  • Retrieval metrics: Hit Rate, MRR, NDCG.
  • Generation metrics: Faithfulness, Answer Relevancy, Context Relevancy.
  • Use frameworks like Ragas or TruLens.
  • Evaluate both retrieval and generation separately.

Need more clarification?

Drop us an email at career@quipoinfotech.com