Most developers skip evaluation entirely. They vibe-check their LLM app during development, ship it, and find out what is broken from user complaints. This is how you ship an agent that works great on your test cases and fails on 40% of production inputs.
Evaluation does not require a $500/month platform. Here is the full free eval setup I use before shipping any LLM feature.
Analysis Briefing
- Topic: Free LLM evaluation infrastructure for production apps
- Analyst: Mike D (@MrComputerScience)
- Context: A structured investigation kicked off by Claude Sonnet 4.6
- Source: Pithy Cyborg | Pithy Security
- Key Question: How do you know if your LLM app is actually working before users tell you it isn’t?
Why Evals Are Not Optional
A language model is not deterministic. The same input does not always produce the same output. Your app’s quality is a distribution, not a value, and you cannot characterize a distribution by spot-checking three examples.
What you need to know before shipping:
- What percentage of inputs produce an output that meets your quality bar?
- Which input types produce the most failures?
- Does a prompt change that improves one case break another?
You cannot answer these questions without a systematic eval. You can only answer them by running your app against a representative set of inputs and measuring the outputs against defined quality criteria.
The Three Eval Types Every LLM App Needs
Type 1: Exact Match Evals
For tasks with deterministic correct answers. If your app extracts structured data from text, classifies inputs into categories, or produces outputs with clear right/wrong criteria, exact match evals apply.
import json
from dataclasses import dataclass
@dataclass
class EvalCase:
input: str
expected_output: dict
def run_exact_match_eval(app_fn, cases: list[EvalCase]) -> dict:
results = {"passed": 0, "failed": 0, "failures": []}
for case in cases:
actual = app_fn(case.input)
if actual == case.expected_output:
results["passed"] += 1
else:
results["failed"] += 1
results["failures"].append({
"input": case.input,
"expected": case.expected_output,
"actual": actual,
})
total = results["passed"] + results["failed"]
results["pass_rate"] = results["passed"] / total if total > 0 else 0
return results
Build your eval cases from real examples. Take 50 real inputs from production or from your expected use cases. Write the correct output for each. Run the eval. 80% pass rate is your minimum bar before shipping. Below that, fix your prompt before launching.
Type 2: LLM-as-Judge Evals
For tasks where quality is real but not binary. Summarization quality, response helpfulness, factual accuracy on open-ended questions, tone appropriateness. Exact match cannot capture these. LLM-as-judge can.
The key is a scoring rubric that the judge model applies consistently:
from groq import Groq
client = Groq()
JUDGE_PROMPT = """You are evaluating the quality of an AI assistant's response.
Task description: {task_description}
User input: {user_input}
AI response: {ai_response}
Score the response on a scale of 1-5 for each criterion:
1. Accuracy: Is the information correct and well-grounded?
2. Completeness: Does it address all parts of the user's question?
3. Clarity: Is it easy to understand?
4. Conciseness: Does it avoid unnecessary content?
Return ONLY a JSON object in this exact format:
{{"accuracy": <1-5>, "completeness": <1-5>, "clarity": <1-5>, "conciseness": <1-5>, "reasoning": "<one sentence>"}}"""
def judge_response(task_description: str, user_input: str, ai_response: str) -> dict:
prompt = JUDGE_PROMPT.format(
task_description=task_description,
user_input=user_input,
ai_response=ai_response,
)
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": prompt}],
max_tokens=256,
)
content = response.choices[0].message.content.strip()
# Strip markdown code fences if present
if content.startswith("```"):
content = content.split("```")[1]
if content.startswith("json"):
content = content[4:]
return json.loads(content)
Use Groq for the judge model. It is free, fast, and Llama 3.3 70B produces consistent scoring on well-specified rubrics.
Practical note: run each eval case through the judge twice with slight prompt variations and average the scores. Single-judge evals have high variance. Two runs costs two API calls and produces much more stable scores.
Type 3: Regression Evals
For catching when a change breaks something that was working. Build a golden dataset: 30 to 50 input/output pairs that represent your app working correctly. Run it before and after every prompt change.
import json
import datetime
from pathlib import Path
def run_regression_eval(app_fn, golden_dataset_path: str, judge_fn) -> dict:
with open(golden_dataset_path) as f:
golden = json.load(f)
results = {
"timestamp": datetime.datetime.utcnow().isoformat(),
"cases": [],
"summary": {},
}
scores = []
for case in golden["cases"]:
actual = app_fn(case["input"])
score = judge_fn(
task_description=golden["task_description"],
user_input=case["input"],
ai_response=actual,
)
avg_score = sum([
score["accuracy"], score["completeness"],
score["clarity"], score["conciseness"]
]) / 4
scores.append(avg_score)
results["cases"].append({
"input": case["input"],
"output": actual,
"score": score,
"avg_score": avg_score,
})
results["summary"] = {
"mean_score": sum(scores) / len(scores),
"min_score": min(scores),
"cases_below_3": sum(1 for s in scores if s < 3),
}
# Save results for comparison
output_path = Path(f"eval_results_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.json")
output_path.write_text(json.dumps(results, indent=2))
return results
Commit the golden dataset to your repository. Run the regression eval in CI as part of your test suite. A prompt change that drops the mean score by more than 0.3 points should not merge.
Promptfoo: Free Eval Platform That Is Not a Toy
Promptfoo is the best free evaluation platform available. It is open source, runs locally, and does not require sending your data to any service. Install it:
npm install -g promptfoo
Configure an eval in promptfooconfig.yaml:
prompts:
- "Answer this question concisely: {{question}}"
- "{{question}}\n\nAnswer directly and briefly."
providers:
- id: groq:llama-3.3-70b-versatile
- id: openai:gpt-4o-mini
tests:
- vars:
question: "What is the capital of France?"
assert:
- type: contains
value: "Paris"
- type: llm-rubric
value: "Response is one to two sentences maximum"
- vars:
question: "Explain how HTTPS works"
assert:
- type: llm-rubric
value: "Explanation covers TLS handshake, certificates, and encrypted communication"
- type: javascript
value: "output.split(' ').length < 150" # under 150 words
Run with promptfoo eval. Promptfoo runs both prompt variants across all providers on all test cases, scores them against your assertions, and produces a comparison table. You see exactly which prompt performs better and on which cases.
This is how you A/B test prompt changes with data instead of vibes.
Building Your Eval Dataset
The most common mistake is writing eval cases that only represent happy-path inputs. Your eval set should include:
Edge cases — empty inputs, very long inputs, inputs in unexpected languages, inputs with special characters
Adversarial inputs — prompts that previous versions of your app handled badly, inputs designed to produce failures in the specific failure modes you have observed
Representative distribution — roughly match the distribution of real inputs your app will receive, not just the easy ones
Start with 30 cases. Add 5 cases every time you find a new failure in production. Within three months of shipping you will have a 60 to 80 case eval set that characterizes your app’s behavior well.
The CI Integration
Add this to your GitHub Actions workflow:
name: LLM Eval
on:
pull_request:
paths:
- 'prompts/**'
- 'src/llm/**'
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.12'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run regression eval
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
run: python run_eval.py --fail-below 3.5
The eval only runs on PRs that touch prompt or LLM code, so you are not burning free tier credits on unrelated changes. A PR that drops mean score below 3.5 fails CI and does not merge.
This takes one afternoon to set up and catches regressions that would otherwise reach users.
Mike D builds in public at @MrComputerScience. All code in this post runs.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg → AI news made simple without hype.
- Pithy Security → Stay ahead of cybersecurity threats.
