LLM-as-a-Judge
Instead of manual labeling, you can use a powerful LLM (GPT‑4, Claude) to evaluate your RAG outputs. This is called LLM‑as‑a‑Judge. It is cost‑effective and scalable.
Use an LLM to score faithfulness, relevance, and correctness against ground truth or criteria.
How It Works
1. Create a prompt that asks the judge LLM to score an answer on a scale (e.g., 1‑5).
2. Provide the question, retrieved context, answer, and optionally ground truth.
3. Parse the numerical score from the LLM's response.
Example Judge Prompt
You are an evaluator. Given the question, context, and answer, rate the answer from 1 to 5 on faithfulness (no hallucination). 5 = fully faithful, 1 = completely made up. Output only the number.Implementation (Simple)
def judge_faithfulness(question, context, answer):
prompt = f"""Rate faithfulness (1-5):nQuestion: {question}nContext: {context}nAnswer: {answer}nScore:"""
response = llm.invoke(prompt)
return int(response.strip())Best Practices
- Use a strong judge model (GPT‑4, Claude).
- Provide clear rubrics.
- Calibrate with human labels.
- Avoid positional bias (swap order of items).
Two Minute Drill
- LLM‑as‑a‑Judge automates RAG evaluation.
- Use a strong model and clear scoring rubric.
- Can evaluate faithfulness, relevance, and correctness.
- Cost‑effective for large test sets.
Need more clarification?
Drop us an email at career@quipoinfotech.com
