Gemini 2.5 Pro advertises a 1 million token context window. I tested it against a real 100k token Python codebase to find out where the comprehension holds and where it fails. The answer is more nuanced than the marketing suggests.
Analysis Briefing
- Topic: Gemini 2.5 Pro long-context codebase comprehension benchmark
- Analyst: Mike D (@MrComputerScience)
- Context: A research sprint initiated by Gemini 2.5 Pro
- Source: Pithy Cyborg | Pithy Security
- Key Question: Does a million-token context window actually work on a real codebase, or does it quietly lose the middle?
The Test Setup
I used a real open-source Python project: a medium-sized web application with a Django backend, async task processing, and a React frontend. The Python backend alone tokenizes to approximately 103,000 tokens across 47 files.
I loaded the entire Python codebase into a single Gemini 2.5 Pro context window using the API and ran six categories of queries, from simple fact retrieval to complex cross-file reasoning.
No chunking. No RAG. The full codebase in one shot.
Test 1: Simple Fact Retrieval
Query: What database does this application use and where is the connection configured?
Result: Correct. Gemini identified PostgreSQL, found the DATABASES configuration in settings.py, noted the environment variable override pattern in settings_production.py, and correctly identified that the connection pool is configured in a custom database router.
Assessment: Long-context retrieval for single, well-defined facts works well. This is the easiest case and Gemini handles it cleanly.
Test 2: Function Tracing
Query: Trace the full execution path when a user submits an order. Start from the API endpoint and follow every function call until the order is persisted to the database.
Result: Mostly correct with one missed branch. Gemini correctly traced the main path through the API view, serializer, service layer, and model save. It missed a branch in the service layer that triggers a separate inventory check through a different service class that is defined in a file it appeared to weight less heavily.
The missed branch was in a file near the middle of the context. The files at the beginning (models, settings) and end (the main views file) were represented more accurately than files in the middle.
Assessment: This is the U-shaped attention curve in action at 100k tokens. Core architecture files loaded early and the files you are actively asking about load late. Files in the middle of a large context receive genuinely less reliable attention.
Test 3: Cross-File Dependency Analysis
Query: List every file that imports from utils/email.py and describe what each import uses from that module.
Result: Found 8 of 11 actual importers. Missed three files that were loaded contiguously in the middle of the context window. For the 8 it found, the descriptions of what each imported were accurate.
Assessment: Consistent with Test 2. Mid-context files are the reliability weak point. If you need complete cross-file analysis, expect to miss 10-30% of references in a 100k context.
Test 4: Bug Finding
Query: Look for potential N+1 query problems in the view layer.
Result: Found three actual N+1 problems correctly and described each one clearly. Did not hallucinate any N+1 problems that were not there. This was the cleanest result of the six tests.
Assessment: Pattern recognition within individual files works reliably even in a long context. The model is finding a specific code pattern, which relies less on cross-file reasoning and more on recognizing a local code smell.
Test 5: Architecture Summary
Query: Describe the overall architecture of this application. What are the main components, how do they communicate, and what external services does it depend on?
Result: Accurate high-level description. Correctly identified the Django REST API, Celery task queue, Redis cache, PostgreSQL database, S3 file storage, and SendGrid email integration. The description of how components communicate was accurate and complete.
Assessment: High-level architecture questions that draw on patterns visible throughout the codebase work well. The model synthesizes the common patterns it sees across many files effectively.
Test 6: Specific Change Request
Query: The UserSerializer in serializers/user.py needs to include the user’s subscription status. Describe exactly what changes to make and which other files might be affected.
Result: Correctly identified the serializer changes needed and found the subscription model relationship. Missed one view that would need updating because it explicitly excludes the subscription field via fields = ['id', 'email', 'username']. The excluded fields list was in a different file from the serializer and appears to have been in a low-attention region of the context.
Assessment: Cross-file impact analysis is unreliable for the 10-30% of files in the middle of a large context. Use it to find most of the affected files, then verify manually.
The Practical Numbers
| Task Type | Reliability at 100k Tokens |
|---|---|
| Single fact retrieval | High (~95%) |
| Pattern recognition within files | High (~90%) |
| High-level architecture synthesis | High (~85%) |
| Cross-file tracing (main path) | Medium (~75%) |
| Complete cross-file dependency analysis | Medium (~70%) |
| Complete impact analysis for changes | Medium (~65%) |
These are rough estimates from a single codebase. Your numbers will vary based on which files land where in the context and how cross-cutting the analysis needs to be.
What Actually Helps
Load the files you care about last. The files you are actively asking about should be at the end of the context, where attention is strongest. Load configuration files, models, and shared utilities first. Load the specific files relevant to your query last.
Ask explicitly about what you might have missed. “Are there any other files that might be affected that you did not mention?” produces useful additional results about half the time. The model sometimes knows it was uncertain and will flag it when asked.
Use targeted queries, not open-ended ones. “List every import of utils/email.py” is more reliable than “find all the email-related code.” Specific questions with verifiable answers produce more reliable results than open-ended exploration.
Verify completeness on critical analyses. If you are using this for security audit, dependency analysis before a refactor, or any task where missing files has real consequences, verify the results with a code search tool rather than trusting long-context completeness.
Is 1M Context Worth Using?
For codebase understanding tasks, Gemini 2.5 Pro at 100k tokens is genuinely useful. It is not a complete replacement for a code search tool or a human reviewer who has worked in the codebase. It is the difference between asking questions of a developer who read your codebase once versus one who has worked in it for six months.
For tasks that do not require complete coverage, the 95% reliability at simple retrieval and 85% at architecture synthesis is useful. For tasks that require complete coverage, chunk your codebase into focused sections and query each section directly rather than sending everything and hoping the middle does not get lost.
The million token context window is real. The attention reliability across that entire window is not uniform. Use it accordingly.
Mike D tests AI tools on real code. Follow at @MrComputerScience.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg → AI news made simple without hype.
- Pithy Security → Stay ahead of cybersecurity threats.
