Java 21 virtual threads eliminate the thread-per-request ceiling that made parallel LLM API calls expensive in traditional Java. Instead of blocking a platform thread for each Gemini API call, virtual threads park cheaply during I/O and let you run thousands of concurrent prompt requests with a fraction of the memory overhead. For batch inference workloads, this changes the economics of parallel LLM processing entirely.
Analysis Briefing
- Topic: Java 21 Virtual Threads for Parallel Gemini API Batching
- Analyst: Mike D (@MrComputerScience)
- Context: An adversarial analysis prompted by Gemini 2.5 Pro
- Source: Pithy Cyborg | Pithy Security
- Key Question: How do Java 21 virtual threads change the math on parallel Gemini API batching?
Why Traditional Java Threads Bottleneck LLM Batch Workloads
A Gemini API call takes anywhere from 500 milliseconds to several seconds depending on prompt length, model variant, and output token count. In a traditional Java thread model, each in-flight API call blocks a platform thread for the full duration of that wait. Platform threads are expensive: each carries a default stack size of 512KB to 1MB and maps to an OS thread. Running 500 concurrent Gemini calls the traditional way means 500 platform threads, consuming 250MB to 500MB of stack memory before a single byte of application data is allocated.
Thread pools cap this by limiting concurrency, but capping concurrency on I/O-bound LLM calls is exactly the wrong tradeoff. A fixed pool of 50 threads processing 10,000 prompts sequentially in batches of 50 takes 200 pool cycles. If each Gemini call averages 2 seconds, total wall time is 400 seconds. The threads are not computing during that time. They are waiting. You are paying for platform thread overhead to do nothing.
Virtual threads, introduced as a preview in Java 19 and finalized in Java 21, solve this by decoupling concurrency from OS thread count. A virtual thread parks on I/O without blocking its carrier platform thread. The JVM remounts the virtual thread onto a carrier when the I/O completes. You can run 10,000 concurrent virtual threads on a machine with 8 platform threads. Memory overhead per virtual thread starts at a few hundred bytes rather than hundreds of kilobytes.
For Gemini API batch workloads where each task is almost entirely I/O wait, virtual threads are the correct concurrency primitive. OpenAI rate limits breaking async Python covers the Python side of the same problem: async frameworks exist precisely because blocking threads on LLM I/O is wasteful, and Java 21 finally gives the JVM a first-class answer that does not require rewriting everything in reactive style.
Implementing Parallel Gemini Batching With Virtual Threads
Java 21’s Executors.newVirtualThreadPerTaskExecutor() creates an executor that spawns a new virtual thread for every submitted task. Combined with ExecutorService.invokeAll() or structured concurrency via StructuredTaskScope, this gives you clean parallel batch execution with straightforward error handling.
Add the Gemini SDK dependency:
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-aiplatform</artifactId>
<version>3.42.0</version>
</dependency>
Basic virtual thread batch executor:
import com.google.cloud.vertexai.VertexAI;
import com.google.cloud.vertexai.api.GenerateContentResponse;
import com.google.cloud.vertexai.generativeai.GenerativeModel;
import com.google.cloud.vertexai.generativeai.ResponseHandler;
import java.util.List;
import java.util.concurrent.*;
import java.util.stream.Collectors;
public class GeminiBatchProcessor {
private final GenerativeModel model;
public GeminiBatchProcessor(String projectId, String location) throws Exception {
VertexAI vertexAI = new VertexAI(projectId, location);
this.model = new GenerativeModel("gemini-2.5-pro", vertexAI);
}
public List<String> processBatch(List<String> prompts) throws InterruptedException {
// Virtual thread executor: one virtual thread per task
try (ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor()) {
List<Callable<String>> tasks = prompts.stream()
.map(prompt -> (Callable<String>) () -> queryGemini(prompt))
.collect(Collectors.toList());
List<Future<String>> futures = executor.invokeAll(tasks);
return futures.stream().map(future -> {
try {
return future.get();
} catch (ExecutionException e) {
return "ERROR: " + e.getCause().getMessage();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
return "INTERRUPTED";
}
}).collect(Collectors.toList());
}
}
private String queryGemini(String prompt) {
try {
GenerateContentResponse response = model.generateContent(prompt);
return ResponseHandler.getText(response);
} catch (Exception e) {
throw new RuntimeException("Gemini API call failed: " + e.getMessage(), e);
}
}
public static void main(String[] args) throws Exception {
GeminiBatchProcessor processor = new GeminiBatchProcessor(
System.getenv("GCP_PROJECT_ID"),
"us-central1"
);
List<String> prompts = List.of(
"Summarize the key benefits of Java 21 virtual threads in one paragraph.",
"Write a Python function to compute prime numbers up to 1000.",
"Explain the difference between a mutex and a semaphore in three sentences.",
"What is the time complexity of Dijkstra's algorithm and why?"
);
long start = System.currentTimeMillis();
List<String> results = processor.processBatch(prompts);
long elapsed = System.currentTimeMillis() - start;
for (int i = 0; i < results.size(); i++) {
System.out.printf("Prompt %d: %s%n%n", i + 1, results.get(i));
}
System.out.printf("Processed %d prompts in %dms%n", prompts.size(), elapsed);
}
}
The try-with-resources block on the executor ensures clean shutdown after all tasks complete. Virtual thread executors implement AutoCloseable in Java 21, so close() waits for all submitted tasks to finish before releasing resources.
Rate Limiting and Structured Concurrency for Production Batches
Submitting 10,000 virtual threads simultaneously against the Gemini API will trigger rate limit errors before the concurrency benefits materialize. Production batch processing requires a semaphore to cap in-flight requests at your quota ceiling, and Java 21’s structured concurrency API to manage task lifecycles cleanly.
Rate-limited batch processor with structured concurrency:
import java.util.concurrent.StructuredTaskScope;
import java.util.concurrent.Semaphore;
public class RateLimitedGeminiBatchProcessor {
private final GenerativeModel model;
private final Semaphore rateLimiter;
public RateLimitedGeminiBatchProcessor(
String projectId,
String location,
int maxConcurrentRequests) throws Exception {
VertexAI vertexAI = new VertexAI(projectId, location);
this.model = new GenerativeModel("gemini-2.5-pro", vertexAI);
this.rateLimiter = new Semaphore(maxConcurrentRequests);
}
public List<String> processBatch(List<String> prompts) throws Exception {
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
List<StructuredTaskScope.Subtask<String>> subtasks = prompts.stream()
.map(prompt -> scope.fork(() -> {
rateLimiter.acquire();
try {
return queryGemini(prompt);
} finally {
rateLimiter.release();
}
}))
.collect(Collectors.toList());
scope.join().throwIfFailed();
return subtasks.stream()
.map(StructuredTaskScope.Subtask::get)
.collect(Collectors.toList());
}
}
}
StructuredTaskScope.ShutdownOnFailure cancels all in-flight subtasks if any single task throws an exception. This prevents partial batch completion from silently producing incomplete results. If one Gemini call fails with a 503, all other pending calls are cancelled and the exception propagates cleanly to the caller, which can then decide whether to retry the full batch or the failed subset.
Set maxConcurrentRequests to 80% of your Gemini quota ceiling rather than the full ceiling. The buffer absorbs quota fluctuations from other services sharing the same GCP project credentials without triggering 429s.
What This Means For You
- Switch from
Executors.newFixedThreadPool()toExecutors.newVirtualThreadPerTaskExecutor()for any Java service making parallel LLM API calls. The change is one line and the throughput improvement on I/O-bound workloads is immediate and measurable. - Cap in-flight Gemini requests with a
Semaphoreset to 80% of your quota limit. Submitting unlimited concurrent requests defeats the rate limiter and produces cascading 429 errors that are slower to recover from than a properly throttled queue. - Use
StructuredTaskScope.ShutdownOnFailurefor batch jobs where partial results are unacceptable. It guarantees all-or-nothing completion semantics with clean cancellation of in-flight tasks on first failure. - Measure wall time on your batch workload before and after the virtual thread migration. For batches of 50 or more Gemini calls, expect 3x to 8x wall time reduction compared to a fixed platform thread pool of equivalent size.
- Do not use virtual threads for CPU-bound tasks in the same application. Virtual threads are optimized for I/O-bound concurrency. CPU-bound work should still use platform threads via
ForkJoinPoolto avoid starving the virtual thread carrier pool.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg → AI news made simple without hype.
- Pithy Security → Stay ahead of cybersecurity threats.
