Gemini 2.5 Pro vs Grok 4: Code Generation Accuracy Test

Gemini 2.5 Pro and Grok 4 both claim strong multi-language code generation, but benchmarks from labs tell you nothing about your specific workload. A Python benchmarking harness that tests both models on identical prompts, executes the output, and scores correctness automatically gives you data that actually matters for your use case. This article builds that harness from scratch.

Analysis Briefing

Topic: Gemini 2.5 Pro vs Grok 4 Code Generation Benchmarking
Analyst: Mike D (@MrComputerScience)
Context: A research sprint initiated by Grok 4.20
Source: Pithy Cyborg | Pithy Security
Key Question: Which model actually writes better code across Python, JavaScript, and Rust?

Why Lab Benchmarks Lie and Personal Harnesses Tell the Truth

HumanEval, MBPP, and SWE-bench are the standard code generation benchmarks cited in every model release post. They are useful for tracking progress across model generations. They are nearly useless for deciding which model to use in your specific coding workflow.

Lab benchmarks use fixed prompt sets, fixed evaluation criteria, and fixed language distributions. If your work is 70% Python data pipelines and 30% Rust systems code, a benchmark weighted toward JavaScript algorithms tells you nothing predictive. If you care about code that runs correctly in your environment with your dependencies, a benchmark that checks syntactic correctness rather than execution output is measuring the wrong thing.

The only benchmark that matters is one you run yourself, on prompts that represent your actual workload, evaluated against execution results rather than string matching. 4-bit DeepSeek R1 vs GPT-4o benchmarks on Project Euler demonstrate exactly this: model rankings shift dramatically depending on problem domain, and the model that wins on mathematical reasoning does not necessarily win on practical code generation tasks.

Building your own harness takes two hours. The signal it produces is worth far more than any paper’s leaderboard position.

The Benchmarking Harness: Architecture and Core Code

The harness has four components: a prompt library, two model clients, an execution sandbox, and a scorer. Each prompt specifies a task, a target language, and an expected output or validation function.

import json
import subprocess
import tempfile
import os
from openai import OpenAI
import google.generativeai as genai

# Configure clients
grok_client = OpenAI(
    api_key=os.environ["XAI_API_KEY"],
    base_url="https://api.x.ai/v1"
)

genai.configure(api_key=os.environ["GEMINI_API_KEY"])
gemini_model = genai.GenerativeModel("gemini-2.5-pro")

# Prompt library: each entry is a coding task
PROMPTS = [
    {
        "id": "py_fibonacci",
        "language": "python",
        "task": "Write a Python function called fibonacci(n) that returns the nth Fibonacci number using memoization. Include no imports.",
        "validator": lambda code, output: output.strip() == "55",
        "test_harness": "print(fibonacci(10))"
    },
    {
        "id": "js_palindrome",
        "language": "javascript",
        "task": "Write a JavaScript function called isPalindrome(s) that returns true if s is a palindrome ignoring case and spaces.",
        "validator": lambda code, output: "true" in output.lower() and "false" in output.lower(),
        "test_harness": "console.log(isPalindrome('A man a plan a canal Panama')); console.log(isPalindrome('hello'));"
    },
    {
        "id": "rust_sum",
        "language": "rust",
        "task": "Write a complete Rust main() function that computes the sum of squares of all integers from 1 to 100 and prints the result.",
        "validator": lambda code, output: "338350" in output,
        "test_harness": None  # Rust includes main()
    },
]

def query_grok(task: str) -> str:
    response = grok_client.chat.completions.create(
        model="grok-4",
        messages=[
            {
                "role": "system",
                "content": "You are a code generation assistant. Return only raw code with no markdown fences, no explanations, no comments unless the task requires them."
            },
            {"role": "user", "content": task}
        ],
        temperature=0
    )
    return response.choices[0].message.content.strip()

def query_gemini(task: str) -> str:
    response = gemini_model.generate_content(
        f"You are a code generation assistant. Return only raw code with no markdown fences, no explanations.\n\n{task}",
        generation_config={"temperature": 0}
    )
    return response.text.strip()

Setting temperature=0 on both models is critical for reproducibility. A fair benchmark requires deterministic outputs. Running at higher temperatures introduces variance that makes model comparisons meaningless across runs.

Executing Generated Code and Scoring Results Automatically

The execution layer runs generated code in a subprocess with a timeout, captures stdout, and passes it to the validator function defined in each prompt.

def execute_python(code: str, test_harness: str) -> tuple[str, bool]:
    full_code = code + "\n" + test_harness
    with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
        f.write(full_code)
        fname = f.name
    try:
        result = subprocess.run(
            ["python3", fname],
            capture_output=True, text=True, timeout=10
        )
        return result.stdout + result.stderr, result.returncode == 0
    except subprocess.TimeoutExpired:
        return "TIMEOUT", False
    finally:
        os.unlink(fname)

def execute_rust(code: str) -> tuple[str, bool]:
    with tempfile.TemporaryDirectory() as tmpdir:
        src = os.path.join(tmpdir, "main.rs")
        binary = os.path.join(tmpdir, "main")
        with open(src, "w") as f:
            f.write(code)
        compile_result = subprocess.run(
            ["rustc", src, "-o", binary],
            capture_output=True, text=True, timeout=30
        )
        if compile_result.returncode != 0:
            return compile_result.stderr, False
        run_result = subprocess.run(
            [binary], capture_output=True, text=True, timeout=10
        )
        return run_result.stdout, run_result.returncode == 0

def run_benchmark():
    results = []

    for prompt in PROMPTS:
        print(f"\nRunning: {prompt['id']}")

        grok_code = query_grok(prompt["task"])
        gemini_code = query_gemini(prompt["task"])

        for model_name, code in [("grok-4", grok_code), ("gemini-2.5-pro", gemini_code)]:
            if prompt["language"] == "python":
                output, ran = execute_python(code, prompt["test_harness"])
            elif prompt["language"] == "rust":
                output, ran = execute_rust(code)
            else:
                output, ran = "JS execution skipped", False

            passed = ran and prompt["validator"](code, output)
            results.append({
                "prompt_id": prompt["id"],
                "language": prompt["language"],
                "model": model_name,
                "output": output[:200],
                "passed": passed
            })
            print(f"  {model_name}: {'PASS' if passed else 'FAIL'}")

    # Summary
    print("\n=== RESULTS ===")
    for model in ["grok-4", "gemini-2.5-pro"]:
        model_results = [r for r in results if r["model"] == model]
        passed = sum(1 for r in model_results if r["passed"])
        print(f"{model}: {passed}/{len(model_results)} passed")

    return results

if __name__ == "__main__":
    run_benchmark()

The JavaScript execution path is marked as skipped in this harness because running arbitrary JS requires Node.js and introduces additional sandbox complexity. Add a Node.js subprocess handler following the same pattern as the Python executor to complete multi-language coverage.

What This Means For You

Run benchmarks at temperature=0 for both models. Any other setting introduces non-determinism that makes your results unreproducible and your comparisons meaningless.
Write validators that check execution output, not code structure. A model that produces syntactically different but functionally identical code should score the same as one that matches your expected pattern exactly.
Add a rerun section to every benchmarking article you publish. Include the full prompt library as a downloadable file so readers can reproduce your results as models update. Rankings shift with every model release and a static screenshot is worthless in six months.
Test on your actual workload, not synthetic toy problems. Pull three to five representative tasks from your real codebase, anonymize them, and add them to the prompt library. Those results will predict model usefulness far better than fibonacci and palindrome checks.
Track model latency alongside correctness. A model that scores 95% correct but takes eight seconds per response may lose to a model scoring 88% correct in one second, depending on your use case. Log time.perf_counter() around each API call and include latency in your scoring output.

Enjoyed this deep dive? Join my inner circle:

Pithy Cyborg → AI news made simple without hype.
Pithy Security → Stay ahead of cybersecurity threats.

Additional menu

Analysis Briefing

Why Lab Benchmarks Lie and Personal Harnesses Tell the Truth

The Benchmarking Harness: Architecture and Core Code

Executing Generated Code and Scoring Results Automatically

What This Means For You

Footer

Get My Latest Artificial Intelligence Newsletter For FREE