How to Run Evals on Your LLM App for Free in 2026

Most developers skip evaluation entirely. They vibe-check their LLM app during development, ship it, and find out what is broken from user complaints. This is how you ship an agent that works great on your test cases and fails on 40% of production inputs.

Evaluation does not require a $500/month platform. Here is the full free eval setup I use before shipping any LLM feature.

Analysis Briefing

Topic: Free LLM evaluation infrastructure for production apps
Analyst: Mike D (@MrComputerScience)
Context: A structured investigation kicked off by Claude Sonnet 4.6
Source: Pithy Cyborg | Pithy Security
Key Question: How do you know if your LLM app is actually working before users tell you it isn’t?

Why Evals Are Not Optional

A language model is not deterministic. The same input does not always produce the same output. Your app’s quality is a distribution, not a value, and you cannot characterize a distribution by spot-checking three examples.

What you need to know before shipping:

What percentage of inputs produce an output that meets your quality bar?
Which input types produce the most failures?
Does a prompt change that improves one case break another?

You cannot answer these questions without a systematic eval. You can only answer them by running your app against a representative set of inputs and measuring the outputs against defined quality criteria.

The Three Eval Types Every LLM App Needs

Type 1: Exact Match Evals

For tasks with deterministic correct answers. If your app extracts structured data from text, classifies inputs into categories, or produces outputs with clear right/wrong criteria, exact match evals apply.

import json
from dataclasses import dataclass

@dataclass
class EvalCase:
    input: str
    expected_output: dict
    
def run_exact_match_eval(app_fn, cases: list[EvalCase]) -> dict:
    results = {"passed": 0, "failed": 0, "failures": []}
    
    for case in cases:
        actual = app_fn(case.input)
        
        if actual == case.expected_output:
            results["passed"] += 1
        else:
            results["failed"] += 1
            results["failures"].append({
                "input": case.input,
                "expected": case.expected_output,
                "actual": actual,
            })
    
    total = results["passed"] + results["failed"]
    results["pass_rate"] = results["passed"] / total if total > 0 else 0
    return results

Build your eval cases from real examples. Take 50 real inputs from production or from your expected use cases. Write the correct output for each. Run the eval. 80% pass rate is your minimum bar before shipping. Below that, fix your prompt before launching.

Type 2: LLM-as-Judge Evals

For tasks where quality is real but not binary. Summarization quality, response helpfulness, factual accuracy on open-ended questions, tone appropriateness. Exact match cannot capture these. LLM-as-judge can.

The key is a scoring rubric that the judge model applies consistently:

from groq import Groq

client = Groq()

JUDGE_PROMPT = """You are evaluating the quality of an AI assistant's response.

Task description: {task_description}

User input: {user_input}

AI response: {ai_response}

Score the response on a scale of 1-5 for each criterion:
1. Accuracy: Is the information correct and well-grounded?
2. Completeness: Does it address all parts of the user's question?
3. Clarity: Is it easy to understand?
4. Conciseness: Does it avoid unnecessary content?

Return ONLY a JSON object in this exact format:
{{"accuracy": <1-5>, "completeness": <1-5>, "clarity": <1-5>, "conciseness": <1-5>, "reasoning": "<one sentence>"}}"""

def judge_response(task_description: str, user_input: str, ai_response: str) -> dict:
    prompt = JUDGE_PROMPT.format(
        task_description=task_description,
        user_input=user_input,
        ai_response=ai_response,
    )
    
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256,
    )
    
    content = response.choices[0].message.content.strip()
    # Strip markdown code fences if present
    if content.startswith("```"):
        content = content.split("```")[1]
        if content.startswith("json"):
            content = content[4:]
    
    return json.loads(content)

Use Groq for the judge model. It is free, fast, and Llama 3.3 70B produces consistent scoring on well-specified rubrics.

Practical note: run each eval case through the judge twice with slight prompt variations and average the scores. Single-judge evals have high variance. Two runs costs two API calls and produces much more stable scores.

Type 3: Regression Evals

For catching when a change breaks something that was working. Build a golden dataset: 30 to 50 input/output pairs that represent your app working correctly. Run it before and after every prompt change.

import json
import datetime
from pathlib import Path

def run_regression_eval(app_fn, golden_dataset_path: str, judge_fn) -> dict:
    with open(golden_dataset_path) as f:
        golden = json.load(f)
    
    results = {
        "timestamp": datetime.datetime.utcnow().isoformat(),
        "cases": [],
        "summary": {},
    }
    
    scores = []
    for case in golden["cases"]:
        actual = app_fn(case["input"])
        score = judge_fn(
            task_description=golden["task_description"],
            user_input=case["input"],
            ai_response=actual,
        )
        
        avg_score = sum([
            score["accuracy"], score["completeness"], 
            score["clarity"], score["conciseness"]
        ]) / 4
        
        scores.append(avg_score)
        results["cases"].append({
            "input": case["input"],
            "output": actual,
            "score": score,
            "avg_score": avg_score,
        })
    
    results["summary"] = {
        "mean_score": sum(scores) / len(scores),
        "min_score": min(scores),
        "cases_below_3": sum(1 for s in scores if s < 3),
    }
    
    # Save results for comparison
    output_path = Path(f"eval_results_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.json")
    output_path.write_text(json.dumps(results, indent=2))
    
    return results

Commit the golden dataset to your repository. Run the regression eval in CI as part of your test suite. A prompt change that drops the mean score by more than 0.3 points should not merge.

Promptfoo: Free Eval Platform That Is Not a Toy

Promptfoo is the best free evaluation platform available. It is open source, runs locally, and does not require sending your data to any service. Install it:

npm install -g promptfoo

Configure an eval in promptfooconfig.yaml:

prompts:
  - "Answer this question concisely: {{question}}"
  - "{{question}}\n\nAnswer directly and briefly."

providers:
  - id: groq:llama-3.3-70b-versatile
  - id: openai:gpt-4o-mini

tests:
  - vars:
      question: "What is the capital of France?"
    assert:
      - type: contains
        value: "Paris"
      - type: llm-rubric
        value: "Response is one to two sentences maximum"
        
  - vars:
      question: "Explain how HTTPS works"
    assert:
      - type: llm-rubric
        value: "Explanation covers TLS handshake, certificates, and encrypted communication"
      - type: javascript
        value: "output.split(' ').length < 150"  # under 150 words

Run with promptfoo eval. Promptfoo runs both prompt variants across all providers on all test cases, scores them against your assertions, and produces a comparison table. You see exactly which prompt performs better and on which cases.

This is how you A/B test prompt changes with data instead of vibes.

Building Your Eval Dataset

The most common mistake is writing eval cases that only represent happy-path inputs. Your eval set should include:

Edge cases — empty inputs, very long inputs, inputs in unexpected languages, inputs with special characters

Adversarial inputs — prompts that previous versions of your app handled badly, inputs designed to produce failures in the specific failure modes you have observed

Representative distribution — roughly match the distribution of real inputs your app will receive, not just the easy ones

Start with 30 cases. Add 5 cases every time you find a new failure in production. Within three months of shipping you will have a 60 to 80 case eval set that characterizes your app’s behavior well.

The CI Integration

Add this to your GitHub Actions workflow:

name: LLM Eval

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/llm/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.12'
          
      - name: Install dependencies
        run: pip install -r requirements.txt
        
      - name: Run regression eval
        env:
          GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
        run: python run_eval.py --fail-below 3.5

The eval only runs on PRs that touch prompt or LLM code, so you are not burning free tier credits on unrelated changes. A PR that drops mean score below 3.5 fails CI and does not merge.

This takes one afternoon to set up and catches regressions that would otherwise reach users.

Mike D builds in public at @MrComputerScience. All code in this post runs.

Enjoyed this deep dive? Join my inner circle:

Pithy Cyborg → AI news made simple without hype.
Pithy Security → Stay ahead of cybersecurity threats.

Additional menu