AI Platform Engineering & MLOps · Part XVIII of 34

Eval as a Test Suite: LLM-as-Judge in CI Without Flaky Merges

Why BLEU and ROUGE fail LLM systems, how LLM-as-judge works, and how to build a deterministic CI gate from probabilistic scores.

11 min read·2 interactive components·7 references

Judge run (1 of N)Score variance bandGate: mean − 1σ ≥ 0.80Flaky lane blocked

Classical software tests are deterministic: the same input produces the same output, and the test either passes or fails. Classical ML evaluation is almost as clean: you hold out a labelled test set, run the model, and compute accuracy, F1, or BLEU against known ground truth. Neither model applies cleanly to a large language model system. The inputs are natural language, the expected outputs are often rubricsrather than exact strings, and the evaluation itself — if done honestly — relies on another model making a probabilistic judgement. This article explains the pattern that has emerged to handle that reality: treating LLM evaluation as a version-controlled test suite that runs in CI on every pull request, with a deterministic pass/fail gate derived from probabilistic scores.

The article assumes you have a working understanding of prompt versioning and the MLOps-to-LLMOps shift covered earlier in this series. The focus here is purely on evaluation infrastructure: what to measure, how to measure it reliably, and how to wire the result into a merge gate.

Why classical metrics fall short

BLEU and ROUGEwere designed for tasks where quality candidates share many exact token-level matches with reference outputs — machine translation and extractive summarisation. A 2022 survey of NLG evaluation metrics found that text-overlap metrics show weak or no correlation with human judgements in open-domain natural language generation tasks [1]. They cannot verify factual accuracy, instruction-following, or tone. A model that paraphrases a correct answer in different words receives a low BLEU score; a model that hallucinates in words that happen to overlap with the reference may score acceptably. For LLM systems — where the tasks are precisely the open-ended ones that BLEU handles worst — overlap metrics are at best weak proxies.

The HELM benchmark(Holistic Evaluation of Language Models) made this multi-dimensional problem explicit by evaluating 30 models across 42 scenarios using 7 distinct metric categories — accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency [2]. Its central finding was that single-metric evaluation consistently misses critical dimensions: a model that tops an accuracy leaderboard may rank poorly on calibration or toxicity. If your test suite only measures task accuracy, you are optimising for one dimension of a seven-dimensional space.

The LLM-as-judge pattern

The dominant response has been to use a separate, capable language model as the judge. The judge receives a rubric, the test input, and the model’s output, and returns a score or preference. Zheng et al. (NeurIPS 2023) evaluated this approach on MT-Bench and Chatbot Arenaacross 3,300 human annotations and found that GPT-4-as-judge achieves over 80% agreement with human raters — the same level of agreement observed between human annotators rating the same outputs [3]. That result established LLM-as-judge as a credible evaluation method for tasks where human annotation is the gold standard but too expensive to run at CI frequency.

The same paper characterised three systematic biases that practitioners need to mitigate:

Position bias: when two outputs are presented for comparison, the judge favours whichever appears first.
Verbosity bias: longer responses receive higher scores independent of quality.
Self-enhancement bias: a model used as its own judge scores its own outputs higher.

Mitigations are practical rather than perfect. Position bias is reduced by swapping the order of compared outputs and averaging. Verbosity biasis reduced by using direct scoring rubrics (“rate this response 1–5 on accuracy”) rather than comparative preference. Self-enhancement biasis eliminated by using a different model as the judge than the one under test — which is the correct architectural pattern in any case. None of these mitigations eliminates variance entirely, which is why the gate logic below treats scores as distributions, not point estimates.

Structuring the eval as a version-controlled test suite

The pattern maps directly to the structure of a software test suite. Three artefacts are committed to version control alongside the model configuration or prompt template they evaluate:

1Test cases. Input–expected-behaviour pairs. For LLM tasks, "expected behaviour" is a rubric: "the response should answer the user’s question without fabricating a source" is a valid expected behaviour. Exact-string expected outputs are only appropriate for narrow, structured tasks.
2Judge prompts. The system prompt and user template fed to the judge model. These are configuration, not code — they must be versioned with the same discipline as the model configuration they evaluate.
3Thresholds. The numeric pass/fail boundaries. Stored as code so that a threshold change is a pull request, reviewable and attributable.

This structure means that a change to a prompt template, a model version bump, or a threshold relaxation all produce a diff in version control and trigger CI. The evaluation suite is not a one-off script or a production dashboard — it is the test harness for the model layer of the system.

Four OSS frameworks and what each is good for

Promptfoo is the most CI-native option. Test cases are defined in YAML, evaluators include LLM judges, regex, and code-based assertions, and the CLI exits with code 100when the pass-rate falls below the configured threshold — a first-class signal for CI failure. It installs as an npm package with no Kubernetes dependency. Its limitation is that it does not emit confidence intervals as a first-class output [4]: a single mean score per test case is what you get.

DeepEval integrates with pytest, which means LLM evaluation becomes a first-class Python test. It ships purpose-built metrics: G-Eval for LLM-judge scoring, faithfulness and contextual precision/recall for RAGpipelines, and hallucination detection. The pytest integration means it slots into any Python CI pipeline without additional configuration. The RAG metrics are its distinguishing feature — teams evaluating retrieval-augmented generation benefit from measuring retrieval quality and generation quality separately [5].

Inspect AI, released by the UK AI Security Institute, follows a task/solver/scorer pattern designed for safety and capability evaluation. Tasks are composable Python objects; solvers define how the model approaches a task; scorers measure the result. It was built for reproducibility and auditability — each evaluation run produces a detailed log that can be archived as a compliance artefact. Several frontier-model evaluation programmes have adopted it for precisely that reason [6].

LangSmith provides managed evaluation alongside tracing and prompt versioning. Test datasets are maintained in the LangSmith UI; evaluation runs are triggered from the SDK. Its integration with the LangChain tracing ecosystem means that the same trace that captures a production failure can seed a regression test. The relevant constraint for platform engineers is deployment model: LangSmith is a SaaS product, and as of mid-2026 the self-hosted option is in limited preview. Teams operating in regulated or network-isolated environments need a self-hosted alternative [7].

Managed platforms (Patronus AI, Galileo) are a fourth category: they handle the infrastructure cost of running judges at scale and provide confidence-interval-aware scoring as a managed service. They are the right fit for teams where the operational overhead of self-hosted judge infrastructure is material, and where SaaS data residency constraints are acceptable.

A worked CI example

A developer opens a pull request that changes a prompt template. The CI pipeline must evaluate the changed template against the current production template and block the merge if the new version regresses. Here is the minimal pipeline structure:

eval-stage.yaml (CI pipeline excerpt)

eval:
  stage: test
  script:
    # Run evaluations against the changed prompt config.
    # The --threshold flag sets the minimum aggregate pass rate.
    # The harness exits with code 100 if the rate falls below it.
    - npx promptfoo eval --config eval/prompt-v2.yaml --threshold 0.85
    # Regression gate: the new version must not score lower than
    # production on the capability benchmark.
    - python eval/compare_to_production.py \
        --baseline eval/scores/production.json \
        --candidate eval/scores/candidate.json \
        --max-regression 0.05
  artifacts:
    paths:
      - eval/scores/
    expire_in: 90 days
  rules:
    - changes:
        - prompts/**
        - model-config/**

The pass/fail rule is explicit and stated in the pipeline configuration itself, not in a shared spreadsheet: the aggregate pass rate must be at least 85%, and the new version must not regress more than 5% below the production baseline. Both numbers are reviewable in the pull request diff. The evaluation artefacts are archived — not only for debugging but as a compliance record if the model is subject to governance requirements.

Deterministic pass/fail from probabilistic scores

The structural challenge with LLM-as-judge is that scores are probabilistic. The same input, the same judge, and the same rubric can return 0.9 on one run and 0.7 on another. A CI gate that reads a single score and compares it to a threshold will produce flaky failures — the merge gate breaks on a particular run, not on a genuine capability regression.

The standard mitigation is to run each test case through the judge N times(typically 3–5) and gate on a statistic derived from the distribution rather than on a single sample. The conservative gate uses the lower bound of a one-standard-deviation interval:

eval_ci.py

import statistics

def evaluate_with_ci(judge_fn, test_case, n_runs=5, threshold=0.80):
    """
    Run the judge N times. Gate on mean - 1 std >= threshold.
    A model that scores 0.9 on four runs and 0.2 on one run
    has mean=0.78 and std=0.31; lower_bound=0.47 — correctly fails.
    """
    scores = [judge_fn(test_case) for _ in range(n_runs)]
    mean = statistics.mean(scores)
    std = statistics.stdev(scores) if len(scores) > 1 else 0.0
    lower_bound = mean - std
    return {
        "mean": round(mean, 3),
        "std": round(std, 3),
        "lower_bound": round(lower_bound, 3),
        "passed": lower_bound >= threshold,
        "scores": scores,
    }

The key insight is that gating on the mean alone is insufficient. A judge that returns 0.9 on four runs and 0.2 on a fifth has a mean of 0.78 — which clears an 0.8 threshold by a narrow margin that obscures real variance. Gating on mean minus one standard deviation surfaces that instability and correctly fails the gate. The cost is additional inference: 5 runs instead of 1 means roughly 5× the judge inference cost for each test case, which should be accounted for in CI budget planning.

The simulator below builds the identical pull request twenty times under each gate rule. Watch how a single-sample gate turns judge variance into merge-gate flakiness, and how the mean − 1σ rule makes the verdict deterministic.

Flaky Merge Simulator

The same PR is built 20times against a fixed 0.80 threshold. Pick the true candidate quality, the judge’s variance, the runs per test case, and the gate rule — then see whether the verdict is deterministic or a coin flip.

Candidate quality

Judge variance

Judge runs per test case (N)

Gate rule

20 CI builds of the identical PR

✓

STABLE PASS

20 of 20 builds passed — every re-run of this PR reaches the same verdict. The gate is deterministic for this configuration.

Avg run mean 0.902 · avg per-build σ 0.032 · threshold 0.80 · true quality 0.90 · judge σ 0.03

Averaging N runs damps the noise, but a high-variance judge can still tip a borderline mean across the threshold between builds.

The confidence interval gap in OSS tooling

As of mid-2026, no widely adopted open-source LLM evaluation harness emits confidence intervals as a first-class output. Promptfoo reports a mean pass rate per test run; DeepEvalreports metric scores per test case in a single pass. Neither tool provides a native N-runs-with-CI output that a CI gate can consume directly. The workaround described above — running the judge N times and computing the distribution in application code — is currently the standard pattern for teams that need this reliability guarantee. It is a gap in the tooling ecosystem, not a deliberate design choice, and it is worth watching: the managed platforms (Patronus, Galileo) have implemented CI-aware scoring, and the gap may close in OSS tooling as the pattern becomes more standard.

Pre-production gate structure

A production-grade LLM evaluation gate has four components that are distinct in purpose and should not be collapsed into a single pass-rate metric:

Regression tests. Fixed input → expected output or expected behaviour. Any failure here is a hard blocker — the model can no longer perform a task it previously performed correctly. These should be deterministic where possible: exact-string or structured-output assertions that do not require a judge.
Capability benchmark. The new version must score at or above the current production version’s mean score on a defined task set. A tolerance of ±5% is common to absorb judge variance without masking real regressions. This gate prevents silent capability degradation.
Safety evaluation. For user-facing prompts: an adversarial probe set covering jailbreak attempts, boundary-pushing inputs, and refusal behaviour. This is domain-specific and requires human-curated test cases — it cannot be generated automatically without defeating the purpose.
Latency and cost budget. The new version must not exceed the production SLA. Token usage and time-to-first-token are measured against a representative traffic sample in the same CI run. A model change that improves capability scores but doubles token cost fails this gate.

The four components answer four different questions: did anything break, did anything get worse, is it safe, and can we afford it. A single pass-rate number cannot answer all four — which is why collapsing them is the most common way production eval gates quietly stop gating.

The builder below assembles the four merge-blocking gates and the four judge-reliability practices from this article. Toggle any component off to see exactly what leaks through to production without it.

Eval Gate Builder

Assemble the pre-production gate. Toggle any component off to see exactly what reaches production without it.

Merge-blocking gates

Judge reliability & governance practices

Gate readiness: 8 / 8

Production-grade eval gate: regressions, capability, safety, and budget are all gated, and the judge scores are trustworthy and auditable.

Nothing leaks through. All four gate components and all four reliability practices are active — this matches the pre-production gate structure from the article.

Deployment context considerations

In regulated environments, the entire eval harness must run on infrastructure the organisation controls. The judge model must be self-hosted— a capable open-weights model served via a local inference server (e.g. vLLM, TGI, or a comparable inference runtime). SaaS eval platforms are excluded by data residency and network isolation requirements.

The evaluation artefacts — test cases, judge prompts, per-run scores, and pass/fail decisions — must be archived as compliance records. In AI governance frameworks such as the EU AI Act, evidence of pre-deployment testing is a required component of the conformity assessment for high-risk AI systems. Treating the CI eval artefacts as ephemeral is a governance risk. The CI pipeline should be configured to retain them for the duration required by the applicable regulatory framework.

References

[1] Sai, A. B., Mohankumar, A. K., and Khapra, M. M. “A Survey of Evaluation Metrics Used for NLG Systems.” ACM Computing Surveys, 2022. arXiv:2008.12009. arxiv.org/abs/2008.12009
[2] Liang, P., Bommasani, R., Lee, T., et al. “Holistic Evaluation of Language Models.” arXiv:2211.09110, 2022. arxiv.org/abs/2211.09110
[3] Zheng, L., Chiang, W.-L., Sheng, Y., et al. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” Advances in Neural Information Processing Systems 36 (NeurIPS 2023). arXiv:2306.05685. arxiv.org/abs/2306.05685
[4] Promptfoo. CI/CD Integration documentation. promptfoo.dev, 2024. (GitHub: github.com/promptfoo/promptfoo) promptfoo.dev/docs/integrations/ci-cd
[5] DeepEval. Metrics documentation. confident-ai.com, 2024. (GitHub: github.com/confident-ai/deepeval) deepeval.com/docs/metrics-introduction
[6] UK AI Security Institute. Inspect AI: A framework for large language model evaluations. aisi.gov.uk, 2024. (GitHub: github.com/UKGovernmentBEIS/inspect_ai) inspect.aisi.org.uk
[7] LangChain. LangSmith documentation. smith.langchain.com, 2024. docs.smith.langchain.com

Continue the Journey

AI Platform