How Do You Measure Whether Your LLM App Is Getting Worse?

LLM application quality degrades silently. Prompt changes, model updates, retrieval drift, and data distribution shift all degrade outputs without raising an exception. The only way to know your app is getting worse is to run evals continuously, compare against a golden dataset, and treat quality as a metric you monitor like latency or error rate.

Analysis Briefing

Topic: LLM application evaluation and quality regression detection
Analyst: Mike D (@MrComputerScience)
Context: Born from an exchange with Claude Sonnet 4.6 that refused to stay shallow
Source: Pithy Cyborg | AI News Made Simple
Key Question: If your LLM app quietly gets worse after a model update, how would you know?

The Three Eval Types That Actually Catch Regressions

Exact match evals are the cheapest and most reliable. If your application extracts structured data from documents, transforms text in a predictable way, or classifies inputs into known categories, you can build a golden dataset of input/expected output pairs and check programmatically whether outputs match. No LLM required to run the evaluation.

LLM-as-judge evals handle outputs where exact match is too rigid. You send the model’s output to a separate evaluator model along with the original input and a rubric. The evaluator scores the response on dimensions like accuracy, groundedness, and completeness. This approach scales to open-ended tasks but introduces the evaluator’s own biases and costs additional API calls.

Human evals remain the ground truth for anything customer-facing. The problem is cost and latency. The practical pattern is to use automated evals for continuous monitoring and trigger human review when automated scores drop below threshold.

Building a Golden Dataset That Stays Useful

A golden dataset is a collection of inputs with verified expected outputs. It needs three properties to stay useful over time.

First, it must cover failure modes, not just happy paths. If your application fails on negation queries, ambiguous references, or edge cases in your domain, those cases must be in the dataset. A golden dataset built only from successful past outputs will not catch the regressions that matter.

Second, it needs to be versioned alongside your prompts and model configuration. An eval run against golden dataset v1 with prompt v3 and model GPT-4o is a different measurement than the same dataset with a different prompt or model. Treat dataset version, prompt hash, and model name as required metadata on every eval run.

Third, add to it continuously from production failures. Every time a user reports a bad output or your monitoring flags an anomaly, the input goes into the golden dataset after manual verification. The dataset should grow every week.

import json
from openai import OpenAI

client = OpenAI()

def run_eval(golden_dataset: list[dict], model: str, prompt_template: str) -> dict:
    results = []
    for example in golden_dataset:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": prompt_template},
                {"role": "user", "content": example["input"]}
            ],
            temperature=0
        )
        output = response.choices[0].message.content
        passed = output.strip() == example["expected"].strip()
        results.append({
            "input": example["input"],
            "expected": example["expected"],
            "actual": output,
            "passed": passed
        })

    pass_rate = sum(r["passed"] for r in results) / len(results)
    return {"pass_rate": pass_rate, "results": results, "model": model}

Wiring Evals Into CI So Regressions Block Deploys

An eval that runs manually gets run rarely. An eval wired into CI runs on every pull request and blocks deployment when pass rate drops below threshold.

The pattern is straightforward. Store your golden dataset in the repository. Add an eval job to your CI pipeline that runs on every PR touching prompts or application logic. Set a minimum pass rate threshold. If the PR drops pass rate from 94% to 87%, the deploy fails and the diff shows exactly which examples regressed.

# .github/workflows/evals.yml
name: LLM Evals
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run evals
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python run_evals.py \
            --dataset golden_dataset.json \
            --min-pass-rate 0.90 \
            --fail-on-regression

The threshold needs judgment. A 90% minimum pass rate on a 50-example dataset is very different from 90% on a 500-example dataset. Start with whatever your current pass rate is and treat any regression as a signal, not a hard block, until you have enough history to set a meaningful threshold.

What This Means For You

Build your golden dataset before you build your eval pipeline, because the tooling is useless without representative examples that include your actual failure modes.
Run evals at temperature 0 to eliminate sampling variance from your regression signal, because you need to know whether the prompt changed the output, not whether the model rolled a different sample.
Store every eval run with its metadata (model name, prompt hash, dataset version, timestamp) so you can bisect regressions to the exact change that caused them.
Add new failures to the golden dataset immediately, because every production failure your users experience is a test case you did not have.

Enjoyed this deep dive? Join my inner circle:

Pithy Cyborg | AI News Made Simple → AI news made simple without hype.

Additional menu

Analysis Briefing

The Three Eval Types That Actually Catch Regressions

Building a Golden Dataset That Stays Useful

Wiring Evals Into CI So Regressions Block Deploys

What This Means For You

Footer

Get My Latest Artificial Intelligence Newsletter For FREE