Eval as a Test Suite: LLM-as-Judge in CI Without Flaky Merges

LLM evaluation running as a first-class CI stage
Classical software tests are deterministic: the same input produces the same output, and the test either passes or fails. Classical ML evaluation is almost as clean: you hold out a labelled test set, run the model, and compute accuracy, F1, or BLEU against known ground truth. Neither model applies cleanly to a large language model system. The inputs are natural language, the expected outputs are often rubrics rather than exact strings, and the evaluation itself — if done honestly — relies on another model making a probabilistic judgement. This article explains the pattern that has emerged to handle that reality: treating LLM evaluation as a version-controlled test suite that runs in CI on every pull request, with a deterministic pass/fail gate derived from probabilistic scores.
The article assumes you have a working understanding of prompt versioning and the MLOps-to-LLMOps shift covered earlier in this series. The focus here is purely on evaluation infrastructure: what to measure, how to measure it reliably, and how to wire the result into a merge gate.
Why classical metrics fall short
The LLM-as-judge pattern
The same paper characterised three systematic biases that practitioners need to mitigate:
- Position bias: when two outputs are presented for comparison, the judge favours whichever appears first.
- Verbosity bias: longer responses receive higher scores independent of quality.
- Self-enhancement bias: a model used as its own judge scores its own outputs higher.
Mitigations are practical rather than perfect. Position bias is reduced by swapping the order of compared outputs and averaging. Verbosity bias is reduced by using direct scoring rubrics ("rate this response 1–5 on accuracy") rather than comparative preference. Self-enhancement bias is eliminated by using a different model as the judge than the one under test — which is the correct architectural pattern in any case. None of these mitigations eliminates variance entirely, which is why the gate logic below treats scores as distributions, not point estimates.
Structuring the eval as a version-controlled test suite
The pattern maps directly to the structure of a software test suite. Three artefacts are committed to version control alongside the model configuration or prompt template they evaluate:
- Test cases. Input–expected-behaviour pairs. For LLM tasks, "expected behaviour" is a rubric: "the response should answer the user's question without fabricating a source" is a valid expected behaviour. Exact-string expected outputs are only appropriate for narrow, structured tasks.
- Judge prompts. The system prompt and user template fed to the judge model. These are configuration, not code — they must be versioned with the same discipline as the model configuration they evaluate.
- Thresholds. The numeric pass/fail boundaries. Stored as code so that a threshold change is a pull request, reviewable and attributable.
This structure means that a change to a prompt template, a model version bump, or a threshold relaxation all produce a diff in version control and trigger CI. The evaluation suite is not a one-off script or a production dashboard — it is the test harness for the model layer of the system.
Four OSS frameworks and what each is good for
Promptfoo is the most CI-native option. Test cases are defined in YAML, evaluators include LLM judges, regex, and code-based assertions, and the CLI exits with code 100 when the pass-rate falls below the configured threshold — a first-class signal for CI failure. It installs as an npm package with no Kubernetes dependency. Its limitation is that it does not emit confidence intervals as a first-class output [4]: a single mean score per test case is what you get.
DeepEval integrates with pytest, which means LLM evaluation becomes a first-class Python test. It ships purpose-built metrics: G-Eval for LLM-judge scoring, faithfulness and contextual precision/recall for RAG pipelines, and hallucination detection. The pytest integration means it slots into any Python CI pipeline without additional configuration. The RAG metrics are its distinguishing feature — teams evaluating retrieval-augmented generation benefit from measuring retrieval quality and generation quality separately [5].
Inspect AI, released by the UK AI Security Institute, follows a task/solver/scorer pattern designed for safety and capability evaluation. Tasks are composable Python objects; solvers define how the model approaches a task; scorers measure the result. It was built for reproducibility and auditability — each evaluation run produces a detailed log that can be archived as a compliance artefact. Several frontier-model evaluation programmes have adopted it for precisely that reason [6].
LangSmith provides managed evaluation alongside tracing and prompt versioning. Test datasets are maintained in the LangSmith UI; evaluation runs are triggered from the SDK. Its integration with the LangChain tracing ecosystem means that the same trace that captures a production failure can seed a regression test. The relevant constraint for platform engineers is deployment model: LangSmith is a SaaS product, and as of mid-2026 the self-hosted option is in limited preview. Teams operating in regulated or network-isolated environments need a self-hosted alternative [7].
Managed platforms (Patronus AI, Galileo) are a fourth category: they handle the infrastructure cost of running judges at scale and provide confidence-interval-aware scoring as a managed service. They are the right fit for teams where the operational overhead of self-hosted judge infrastructure is material, and where SaaS data residency constraints are acceptable.
A worked CI example
A developer opens a pull request that changes a prompt template. The CI pipeline must evaluate the changed template against the current production template and block the merge if the new version regresses. Here is the minimal pipeline structure:
eval:
stage: test
script:
# Run evaluations against the changed prompt config.
# The --threshold flag sets the minimum aggregate pass rate.
# The harness exits with code 100 if the rate falls below it.
- npx promptfoo eval --config eval/prompt-v2.yaml --threshold 0.85
# Regression gate: the new version must not score lower than
# production on the capability benchmark.
- python eval/compare_to_production.py \
--baseline eval/scores/production.json \
--candidate eval/scores/candidate.json \
--max-regression 0.05
artifacts:
paths:
- eval/scores/
expire_in: 90 days
rules:
- changes:
- prompts/**
- model-config/**The pass/fail rule is explicit and stated in the pipeline configuration itself, not in a shared spreadsheet: the aggregate pass rate must be at least 85%, and the new version must not regress more than 5% below the production baseline. Both numbers are reviewable in the pull request diff. The evaluation artefacts are archived — not only for debugging but as a compliance record if the model is subject to governance requirements.
Deterministic pass/fail from probabilistic scores
The structural challenge with LLM-as-judge is that scores are probabilistic. The same input, the same judge, and the same rubric can return 0.9 on one run and 0.7 on another. A CI gate that reads a single score and compares it to a threshold will produce flaky failures — the merge gate breaks on a particular run, not on a genuine capability regression.
The standard mitigation is to run each test case through the judge N times (typically 3–5) and gate on a statistic derived from the distribution rather than on a single sample. The conservative gate uses the lower bound of a one-standard-deviation interval:
import statistics
def evaluate_with_ci(judge_fn, test_case, n_runs=5, threshold=0.80):
"""
Run the judge N times. Gate on mean - 1 std >= threshold.
A model that scores 0.9 on four runs and 0.2 on one run
has mean=0.78 and std=0.31; lower_bound=0.47 — correctly fails.
"""
scores = [judge_fn(test_case) for _ in range(n_runs)]
mean = statistics.mean(scores)
std = statistics.stdev(scores) if len(scores) > 1 else 0.0
lower_bound = mean - std
return {
"mean": round(mean, 3),
"std": round(std, 3),
"lower_bound": round(lower_bound, 3),
"passed": lower_bound >= threshold,
"scores": scores,
}The key insight is that gating on the mean alone is insufficient. A judge that returns 0.9 on four runs and 0.2 on a fifth has a mean of 0.78 — which clears an 0.8 threshold by a narrow margin that obscures real variance. Gating on mean minus one standard deviation surfaces that instability and correctly fails the gate. The cost is additional inference: 5 runs instead of 1 means roughly 5× the judge inference cost for each test case, which should be accounted for in CI budget planning.
The confidence interval gap in OSS tooling
As of mid-2026, no widely adopted open-source LLM evaluation harness emits confidence intervals as a first-class output. Promptfoo reports a mean pass rate per test run; DeepEval reports metric scores per test case in a single pass. Neither tool provides a native N-runs-with-CI output that a CI gate can consume directly. The workaround described above — running the judge N times and computing the distribution in application code — is currently the standard pattern for teams that need this reliability guarantee. It is a gap in the tooling ecosystem, not a deliberate design choice, and it is worth watching: the managed platforms (Patronus, Galileo) have implemented CI-aware scoring, and the gap may close in OSS tooling as the pattern becomes more standard.
Pre-production gate structure
A production-grade LLM evaluation gate has four components that are distinct in purpose and should not be collapsed into a single pass-rate metric:
- Regression tests. Fixed input → expected output or expected behaviour. Any failure here is a hard blocker — the model can no longer perform a task it previously performed correctly. These should be deterministic where possible: exact-string or structured-output assertions that do not require a judge.
- Capability benchmark. The new version must score at or above the current production version's mean score on a defined task set. A tolerance of ±5% is common to absorb judge variance without masking real regressions. This gate prevents silent capability degradation.
- Safety evaluation. For user-facing prompts: an adversarial probe set covering jailbreak attempts, boundary-pushing inputs, and refusal behaviour. This is domain-specific and requires human-curated test cases — it cannot be generated automatically without defeating the purpose.
- Latency and cost budget. The new version must not exceed the production SLA. Token usage and time-to-first-token are measured against a representative traffic sample in the same CI run. A model change that improves capability scores but doubles token cost fails this gate.
Deployment context considerations
In regulated environments, the entire eval harness must run on infrastructure the organisation controls. The judge model must be self-hosted — a capable open-weights model served via a local inference server (e.g. vLLM, TGI, or a comparable inference runtime). SaaS eval platforms are excluded by data residency and network isolation requirements.
The evaluation artefacts — test cases, judge prompts, per-run scores, and pass/fail decisions — must be archived as compliance records. In AI governance frameworks such as the EU AI Act, evidence of pre-deployment testing is a required component of the conformity assessment for high-risk AI systems. Treating the CI eval artefacts as ephemeral is a governance risk. The CI pipeline should be configured to retain them for the duration required by the applicable regulatory framework.
References
- Sai, A. B., Mohankumar, A. K., and Khapra, M. M. "A Survey of Evaluation Metrics Used for NLG Systems." ACM Computing Surveys, 2022. arXiv:2008.12009.
- Liang, P., Bommasani, R., Lee, T., et al. "Holistic Evaluation of Language Models." arXiv:2211.09110, 2022.
- Zheng, L., Chiang, W.-L., Sheng, Y., et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." Advances in Neural Information Processing Systems 36 (NeurIPS 2023). arXiv:2306.05685.
- Promptfoo. CI/CD Integration documentation. promptfoo.dev, 2024. (GitHub: github.com/promptfoo/promptfoo)
- DeepEval. Metrics documentation. confident-ai.com, 2024. (GitHub: github.com/confident-ai/deepeval)
- UK AI Security Institute. Inspect AI: A framework for large language model evaluations. aisi.gov.uk, 2024. (GitHub: github.com/UKGovernmentBEIS/inspect_ai)
- LangChain. LangSmith documentation. smith.langchain.com, 2024.
Tags
About the Author

asleekgeek
Senior Developer, Architect, DevOps
Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist
Thanks for reading! Explore more articles.
Back to Articles