Building Ground Truth
For reliable evaluation, you need a golden dataset (ground truth): a set of questions, expected answers, and relevant document IDs. This allows you to compute metrics like accuracy, recall, and precision.
Ground truth = human‑annotated pairs of questions and correct answers (and relevant context).
How to Build a Golden Dataset
- Select a sample of documents from your knowledge base.
- Write questions that can be answered from these documents.
- Annotate the correct answer (and which document/chunk contains it).
- Aim for at least 50‑200 examples for meaningful evaluation.
Synthetic Data Generation
Use an LLM to generate question‑answer pairs from your documents, then human‑validate a subset.
prompt = f"""Generate a question and answer based on this document chunk:n{chunk}nFormat: Question: ... Answer: ..."""Continuous Evaluation
As your system changes (new chunking, new LLM, new retriever), re‑run evaluation on the golden dataset to detect regressions. Automate this in CI/CD.
Format for Golden Dataset
[
{
"question": "What is RAG?",
"answer": "Retrieval-Augmented Generation...",
"context_ids": ["doc1_chunk3", "doc2_chunk1"]
}
]Two Minute Drill
- Ground truth is essential for accurate evaluation.
- Create a golden dataset with questions, answers, and relevant context.
- Use synthetic generation + human validation.
- Automate continuous evaluation to catch regressions.
Need more clarification?
Drop us an email at career@quipoinfotech.com
