Eval as a Test Suite: LLM-as-Judge in CI Without Flaky Merges

·9 min read·asleekgeek
A terminal window showing a CI pipeline running LLM evaluation tests with pass/fail results

LLM evaluation running as a first-class CI stage

Classical software tests are deterministic: the same input produces the same output, and the test either passes or fails. Classical ML evaluation is almost as clean: you hold out a labelled test set, run the model, and compute accuracy, F1, or BLEU against known ground truth. Neither model applies cleanly to a large language model system. The inputs are natural language, the expected outputs are often rubrics rather than exact strings, and the evaluation itself — if done honestly — relies on another model making a probabilistic judgement. This article explains the pattern that has emerged to handle that reality: treating LLM evaluation as a version-controlled test suite that runs in CI on every pull request, with a deterministic pass/fail gate derived from probabilistic scores.

The article assumes you have a working understanding of prompt versioning and the MLOps-to-LLMOps shift covered earlier in this series. The focus here is purely on evaluation infrastructure: what to measure, how to measure it reliably, and how to wire the result into a merge gate.

Why classical metrics fall short

BLEU and ROUGE were designed for tasks where quality candidates share many exact token-level matches with reference outputs — machine translation and extractive summarisation. A 2022 survey of NLG evaluation metrics found that text-overlap metrics show weak or no correlation with human judgements in open-domain natural language generation tasks [1]. They cannot verify factual accuracy, instruction-following, or tone. A model that paraphrases a correct answer in different words receives a low BLEU score; a model that hallucinates in words that happen to overlap with the reference may score acceptably. For LLM systems — where the tasks are precisely the open-ended ones that BLEU handles worst — overlap metrics are at best weak proxies.

The HELM benchmark (Holistic Evaluation of Language Models) made this multi-dimensional problem explicit by evaluating 30 models across 42 scenarios using 7 distinct metric categories — accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency [2]. Its central finding was that single-metric evaluation consistently misses critical dimensions: a model that tops an accuracy leaderboard may rank poorly on calibration or toxicity. If your test suite only measures task accuracy, you are optimising for one dimension of a seven-dimensional space.

The LLM-as-judge pattern

The dominant response has been to use a separate, capable language model as the judge. The judge receives a rubric, the test input, and the model's output, and returns a score or preference. Zheng et al. (NeurIPS 2023) evaluated this approach on MT-Bench and Chatbot Arena across 3,300 human annotations and found that GPT-4-as-judge achieves over 80% agreement with human raters — the same level of agreement observed between human annotators rating the same outputs [3]. That result established LLM-as-judge as a credible evaluation method for tasks where human annotation is the gold standard but too expensive to run at CI frequency.

The same paper characterised three systematic biases that practitioners need to mitigate:

  • Position bias: when two outputs are presented for comparison, the judge favours whichever appears first.
  • Verbosity bias: longer responses receive higher scores independent of quality.
  • Self-enhancement bias: a model used as its own judge scores its own outputs higher.

Mitigations are practical rather than perfect. Position bias is reduced by swapping the order of compared outputs and averaging. Verbosity bias is reduced by using direct scoring rubrics ("rate this response 1–5 on accuracy") rather than comparative preference. Self-enhancement bias is eliminated by using a different model as the judge than the one under test — which is the correct architectural pattern in any case. None of these mitigations eliminates variance entirely, which is why the gate logic below treats scores as distributions, not point estimates.

Structuring the eval as a version-controlled test suite

The pattern maps directly to the structure of a software test suite. Three artefacts are committed to version control alongside the model configuration or prompt template they evaluate:

  1. Test cases. Input–expected-behaviour pairs. For LLM tasks, "expected behaviour" is a rubric: "the response should answer the user's question without fabricating a source" is a valid expected behaviour. Exact-string expected outputs are only appropriate for narrow, structured tasks.
  2. Judge prompts. The system prompt and user template fed to the judge model. These are configuration, not code — they must be versioned with the same discipline as the model configuration they evaluate.
  3. Thresholds. The numeric pass/fail boundaries. Stored as code so that a threshold change is a pull request, reviewable and attributable.

This structure means that a change to a prompt template, a model version bump, or a threshold relaxation all produce a diff in version control and trigger CI. The evaluation suite is not a one-off script or a production dashboard — it is the test harness for the model layer of the system.

Four OSS frameworks and what each is good for

Promptfoo is the most CI-native option. Test cases are defined in YAML, evaluators include LLM judges, regex, and code-based assertions, and the CLI exits with code 100 when the pass-rate falls below the configured threshold — a first-class signal for CI failure. It installs as an npm package with no Kubernetes dependency. Its limitation is that it does not emit confidence intervals as a first-class output [4]: a single mean score per test case is what you get.

DeepEval integrates with pytest, which means LLM evaluation becomes a first-class Python test. It ships purpose-built metrics: G-Eval for LLM-judge scoring, faithfulness and contextual precision/recall for RAG pipelines, and hallucination detection. The pytest integration means it slots into any Python CI pipeline without additional configuration. The RAG metrics are its distinguishing feature — teams evaluating retrieval-augmented generation benefit from measuring retrieval quality and generation quality separately [5].

Inspect AI, released by the UK AI Security Institute, follows a task/solver/scorer pattern designed for safety and capability evaluation. Tasks are composable Python objects; solvers define how the model approaches a task; scorers measure the result. It was built for reproducibility and auditability — each evaluation run produces a detailed log that can be archived as a compliance artefact. Several frontier-model evaluation programmes have adopted it for precisely that reason [6].

LangSmith provides managed evaluation alongside tracing and prompt versioning. Test datasets are maintained in the LangSmith UI; evaluation runs are triggered from the SDK. Its integration with the LangChain tracing ecosystem means that the same trace that captures a production failure can seed a regression test. The relevant constraint for platform engineers is deployment model: LangSmith is a SaaS product, and as of mid-2026 the self-hosted option is in limited preview. Teams operating in regulated or network-isolated environments need a self-hosted alternative [7].

Managed platforms (Patronus AI, Galileo) are a fourth category: they handle the infrastructure cost of running judges at scale and provide confidence-interval-aware scoring as a managed service. They are the right fit for teams where the operational overhead of self-hosted judge infrastructure is material, and where SaaS data residency constraints are acceptable.

A worked CI example

A developer opens a pull request that changes a prompt template. The CI pipeline must evaluate the changed template against the current production template and block the merge if the new version regresses. Here is the minimal pipeline structure:

eval-stage.yaml (CI pipeline excerpt)
eval:
  stage: test
  script:
    # Run evaluations against the changed prompt config.
    # The --threshold flag sets the minimum aggregate pass rate.
    # The harness exits with code 100 if the rate falls below it.
    - npx promptfoo eval --config eval/prompt-v2.yaml --threshold 0.85
    # Regression gate: the new version must not score lower than
    # production on the capability benchmark.
    - python eval/compare_to_production.py \
        --baseline eval/scores/production.json \
        --candidate eval/scores/candidate.json \
        --max-regression 0.05
  artifacts:
    paths:
      - eval/scores/
    expire_in: 90 days
  rules:
    - changes:
        - prompts/**
        - model-config/**

The pass/fail rule is explicit and stated in the pipeline configuration itself, not in a shared spreadsheet: the aggregate pass rate must be at least 85%, and the new version must not regress more than 5% below the production baseline. Both numbers are reviewable in the pull request diff. The evaluation artefacts are archived — not only for debugging but as a compliance record if the model is subject to governance requirements.

Deterministic pass/fail from probabilistic scores

The structural challenge with LLM-as-judge is that scores are probabilistic. The same input, the same judge, and the same rubric can return 0.9 on one run and 0.7 on another. A CI gate that reads a single score and compares it to a threshold will produce flaky failures — the merge gate breaks on a particular run, not on a genuine capability regression.

The standard mitigation is to run each test case through the judge N times (typically 3–5) and gate on a statistic derived from the distribution rather than on a single sample. The conservative gate uses the lower bound of a one-standard-deviation interval:

eval_ci.py
import statistics

def evaluate_with_ci(judge_fn, test_case, n_runs=5, threshold=0.80):
    """
    Run the judge N times. Gate on mean - 1 std >= threshold.
    A model that scores 0.9 on four runs and 0.2 on one run
    has mean=0.78 and std=0.31; lower_bound=0.47 — correctly fails.
    """
    scores = [judge_fn(test_case) for _ in range(n_runs)]
    mean = statistics.mean(scores)
    std = statistics.stdev(scores) if len(scores) > 1 else 0.0
    lower_bound = mean - std
    return {
        "mean": round(mean, 3),
        "std": round(std, 3),
        "lower_bound": round(lower_bound, 3),
        "passed": lower_bound >= threshold,
        "scores": scores,
    }

The key insight is that gating on the mean alone is insufficient. A judge that returns 0.9 on four runs and 0.2 on a fifth has a mean of 0.78 — which clears an 0.8 threshold by a narrow margin that obscures real variance. Gating on mean minus one standard deviation surfaces that instability and correctly fails the gate. The cost is additional inference: 5 runs instead of 1 means roughly 5× the judge inference cost for each test case, which should be accounted for in CI budget planning.

The confidence interval gap in OSS tooling

As of mid-2026, no widely adopted open-source LLM evaluation harness emits confidence intervals as a first-class output. Promptfoo reports a mean pass rate per test run; DeepEval reports metric scores per test case in a single pass. Neither tool provides a native N-runs-with-CI output that a CI gate can consume directly. The workaround described above — running the judge N times and computing the distribution in application code — is currently the standard pattern for teams that need this reliability guarantee. It is a gap in the tooling ecosystem, not a deliberate design choice, and it is worth watching: the managed platforms (Patronus, Galileo) have implemented CI-aware scoring, and the gap may close in OSS tooling as the pattern becomes more standard.

Pre-production gate structure

A production-grade LLM evaluation gate has four components that are distinct in purpose and should not be collapsed into a single pass-rate metric:

  • Regression tests. Fixed input → expected output or expected behaviour. Any failure here is a hard blocker — the model can no longer perform a task it previously performed correctly. These should be deterministic where possible: exact-string or structured-output assertions that do not require a judge.
  • Capability benchmark. The new version must score at or above the current production version's mean score on a defined task set. A tolerance of ±5% is common to absorb judge variance without masking real regressions. This gate prevents silent capability degradation.
  • Safety evaluation. For user-facing prompts: an adversarial probe set covering jailbreak attempts, boundary-pushing inputs, and refusal behaviour. This is domain-specific and requires human-curated test cases — it cannot be generated automatically without defeating the purpose.
  • Latency and cost budget. The new version must not exceed the production SLA. Token usage and time-to-first-token are measured against a representative traffic sample in the same CI run. A model change that improves capability scores but doubles token cost fails this gate.

Deployment context considerations

In regulated environments, the entire eval harness must run on infrastructure the organisation controls. The judge model must be self-hosted — a capable open-weights model served via a local inference server (e.g. vLLM, TGI, or a comparable inference runtime). SaaS eval platforms are excluded by data residency and network isolation requirements.

The evaluation artefacts — test cases, judge prompts, per-run scores, and pass/fail decisions — must be archived as compliance records. In AI governance frameworks such as the EU AI Act, evidence of pre-deployment testing is a required component of the conformity assessment for high-risk AI systems. Treating the CI eval artefacts as ephemeral is a governance risk. The CI pipeline should be configured to retain them for the duration required by the applicable regulatory framework.

References

  1. Sai, A. B., Mohankumar, A. K., and Khapra, M. M. "A Survey of Evaluation Metrics Used for NLG Systems." ACM Computing Surveys, 2022. arXiv:2008.12009.
  2. Liang, P., Bommasani, R., Lee, T., et al. "Holistic Evaluation of Language Models." arXiv:2211.09110, 2022.
  3. Zheng, L., Chiang, W.-L., Sheng, Y., et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." Advances in Neural Information Processing Systems 36 (NeurIPS 2023). arXiv:2306.05685.
  4. Promptfoo. CI/CD Integration documentation. promptfoo.dev, 2024. (GitHub: github.com/promptfoo/promptfoo)
  5. DeepEval. Metrics documentation. confident-ai.com, 2024. (GitHub: github.com/confident-ai/deepeval)
  6. UK AI Security Institute. Inspect AI: A framework for large language model evaluations. aisi.gov.uk, 2024. (GitHub: github.com/UKGovernmentBEIS/inspect_ai)
  7. LangChain. LangSmith documentation. smith.langchain.com, 2024.

Tags

#eval#llm-as-judge#ci#series:ai-platform-mlops#series-order/18

About the Author

asleekgeek

asleekgeek

Senior Developer, Architect, DevOps

Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist

Thanks for reading! Explore more articles.

Back to Articles