LLM Evaluation — RAGAS, DeepEval, G-Eval

> Exact-match and F1 miss semantic equivalence. Human review does not scale. LLM-as-judge is the production answer — with enough calibration to trust the number.

Type: Build

Languages: Python

Prerequisites: Phase 5 · 13 (Question Answering), Phase 5 · 14 (Information Retrieval)

Time: ~75 minutes

The Problem

Your RAG system answers: "June 29th, 2007."

The gold reference is: "June 29, 2007."

Exact Match scores 0. F1 scores ~75%. A human would score 100%.

Now multiply by 10,000 test cases. Multiply again by every change to the retriever, chunking, prompt, or model. You need an evaluator that understands meaning, runs cheaply at scale, does not lie about regressions, and surfaces the right failure modes.

2026 has three frameworks that own this problem.

All three lean on LLM-as-judge. This lesson builds intuition for the method and the trust layer around it.

The Concept

Four evaluation dimensions, LLM-as-judge architecture

LLM-as-judge. Replace a static metric with an LLM that scores outputs given a rubric. Given (query, context, answer), prompt a judge LLM: "Score 0-1 on faithfulness." Return the score.

Why it works: LLMs approximate human judgment at a tiny fraction of the cost. GPT-4o-mini at ~$0.003 per scored case enables 1000-sample regression eval runs for under $5.

Why it fails silently:

  1. Judge bias. Judges prefer longer answers, answers from their own model family, answers that match the prompt style.
  2. JSON parsing failures. Bad JSON → NaN score → silently excluded from the aggregate. RAGAS users know this pain. Gate with try/except + explicit failure mode.
  3. Drift over model versions. Upgrading the judge changes every metric. Freeze judge model + version.

The RAG four.

Metric Question Backend
Faithfulness Does each claim in the answer come from the retrieved context? NLI-based entailment
Answer relevance Does the answer address the question? Generate hypothetical questions from answer; compare to real question
Context precision Of retrieved chunks, what fraction were relevant? LLM-judge
Context recall Did retrieval return everything needed? LLM-judge against gold answer

G-Eval. Define a custom criterion: "Did the answer cite the correct source?" The framework auto-expands into chain-of-thought evaluation steps, then scores 0-1. Good for domain-specific quality dimensions RAGAS does not cover.

Calibration. Never trust the raw judge score until you have a correlation against human labels. Run 100 hand-labeled examples. Plot judge vs human. Compute Spearman rho. If rho < 0.7, your judge rubric needs work.

Build It

Step 1: faithfulness with NLI (RAGAS-style)

from typing import Callable
from transformers import pipeline

nli = pipeline("text-classification",
               model="MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli",
               top_k=None)

# `llm` is any callable: prompt str -> generated str.
# Example: llm = lambda p: client.messages.create(model="claude-haiku-4-5", ...).content[0].text
LLM = Callable[[str], str]


def atomic_claims(answer: str, llm: LLM) -> list[str]:
    prompt = f"""Break this answer into simple factual claims (one per line):
{answer}
"""
    return llm(prompt).splitlines()


def faithfulness(answer: str, context: str, llm: LLM) -> float:
    claims = atomic_claims(answer, llm)
    if not claims:
        return 0.0
    supported = 0
    for claim in claims:
        result = nli({"text": context, "text_pair": claim})[0]
        entail = next((s for s in result if s["label"] == "entailment"), None)
        if entail and entail["score"] > 0.5:
            supported += 1
    return supported / len(claims)

Decompose the answer into atomic claims. NLI-check each claim against the retrieved context. Faithfulness = fraction supported.

Step 2: answer relevance

import numpy as np
from sentence_transformers import SentenceTransformer

# encoder: any model implementing .encode(texts, normalize_embeddings=True) -> ndarray
# e.g., encoder = SentenceTransformer("BAAI/bge-small-en-v1.5")

def answer_relevance(question: str, answer: str, encoder, llm: LLM, n: int = 3) -> float:
    prompt = f"Write {n} questions this answer could be the answer to:\n{answer}"
    generated = [line for line in llm(prompt).splitlines() if line.strip()][:n]
    if not generated:
        return 0.0
    q_emb = np.asarray(encoder.encode([question], normalize_embeddings=True)[0])
    g_embs = np.asarray(encoder.encode(generated, normalize_embeddings=True))
    sims = [float(q_emb @ g_emb) for g_emb in g_embs]
    return sum(sims) / len(sims)

If the answer implies different questions than the one asked, relevance drops.

Step 3: G-Eval custom metric

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams, LLMTestCase

metric = GEval(
    name="Correctness",
    criteria="The answer should be factually accurate and match the expected output.",
    evaluation_steps=[
        "Read the expected output.",
        "Read the actual output.",
        "List factual claims in the actual output.",
        "For each claim, mark supported or unsupported by the expected output.",
        "Return score = fraction supported.",
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)

test = LLMTestCase(input="When was the first iPhone released?",
                   actual_output="June 29th, 2007.",
                   expected_output="June 29, 2007.")
metric.measure(test)
print(metric.score, metric.reason)

The evaluation steps are the rubric. Explicit steps are more stable than implicit "score 0-1" prompts.

Step 4: CI gate

import deepeval
from deepeval.metrics import FaithfulnessMetric, ContextualRelevancyMetric


def test_rag_system():
    cases = load_regression_cases()
    faith = FaithfulnessMetric(threshold=0.85)
    rel = ContextualRelevancyMetric(threshold=0.7)
    for case in cases:
        faith.measure(case)
        assert faith.score >= 0.85, f"faithfulness regression on {case.id}"
        rel.measure(case)
        assert rel.score >= 0.7, f"relevancy regression on {case.id}"

Ship as a pytest file. Run on every PR. Block merges on regressions.

Step 5: toy eval from scratch

See code/main.py. Stdlib-only approximations of faithfulness (overlap of answer claims with context) and relevance (overlap of answer tokens with question tokens). Not production. Shows the shape.

Pitfalls

Use It

The 2026 stack:

Use case Framework
RAG quality monitoring RAGAS (4 metrics)
CI/CD regression gates DeepEval + pytest
Custom domain criteria G-Eval within DeepEval
Online live-traffic monitoring RAGAS with reference-free mode
Human-in-the-loop spot checks LangSmith or Phoenix with annotation UI
Red-teaming / safety eval Promptfoo + DeepEval

Typical stack: RAGAS for monitoring, DeepEval for CI, G-Eval for novel dimensions. Run all three; they disagree usefully.

Ship It

Save as outputs/skill-eval-architect.md:

name: eval-architect
description: Design an LLM evaluation plan with calibrated judge and CI gates.
version: 1.0.0
phase: 5
lesson: 27
tags: [nlp, evaluation, rag]
---

Given a use case (RAG / agent / generative task), output:

1. Metrics. Faithfulness / relevance / context-precision / context-recall + any custom G-Eval metrics with criteria.
2. Judge model. Named model + version, rationale for cost vs accuracy.
3. Calibration. Hand-labeled set size, target Spearman rho vs human > 0.7.
4. Dataset versioning. Tag strategy, change log, stratification.
5. CI gate. Thresholds per metric, regression-window logic, bottom-quantile alert.

Refuse to rely on a judge untested against ≥50 human-labeled examples. Refuse self-evaluation (same model generates + judges). Refuse aggregate-only reporting without bottom-10% surfacing. Flag any pipeline where judge upgrade lands without parallel baseline eval.

Exercises

  1. Easy. Use RAGAS on 10 RAG examples with known hallucinations. Verify the faithfulness metric catches each one.
  2. Medium. Hand-label 50 QA answers 0-1 for correctness. Score with G-Eval. Measure Spearman rho between judge and human.
  3. Hard. Build a pytest CI gate with DeepEval. Intentionally regress the retriever. Verify the gate fails. Add bottom-quantile alerting via threshold check on the lowest 10%.

Key Terms

Term What people say What it actually means
LLM-as-judge Scoring with an LLM Prompt a judge model to score outputs 0-1 given a rubric.
RAGAS The RAG metric library Open-source eval framework with 4 reference-free RAG metrics.
Faithfulness Is the answer grounded? Fraction of answer claims entailed by retrieved context.
Context precision Were retrieved chunks relevant? Fraction of top-K chunks that actually mattered.
Context recall Did retrieval find everything? Fraction of gold-answer claims supported by retrieved chunks.
G-Eval Custom LLM judge Rubric + chain-of-thought eval steps + 0-1 score.
Calibration Trust but verify Spearman correlation between judge score and human score.

Further Reading