A single-provider LLM integration fails the moment that provider has an outage. A resilient gateway routes to fallback models automatically, applies per-provider timeout configuration, and surfaces provider health without requiring a restart. This post builds one in Spring Boot using the patterns that hold up in production.
Analysis Briefing
- Topic: Multi-provider LLM gateway with fallback routing in Spring Boot
- Analyst: Mike D (@MrComputerScience)
- Context: A technical briefing developed with Claude Sonnet 4.6
- Source: Pithy Cyborg | Pithy Security
- Key Question: When your primary LLM provider goes down, what keeps your Java application running?
The Provider Abstraction Layer
Every provider gets the same interface. The gateway does not know which provider it is calling. It knows the interface, the timeout, and the fallback order.
public interface LlmProvider {
String name();
CompletableFuture<String> complete(String prompt, int maxTokens);
boolean isHealthy();
}
Implement it for each provider with explicit timeout configuration:
@Component
public class OpenAiProvider implements LlmProvider {
private final OpenAIClient client;
private final Duration timeout;
public OpenAiProvider(
@Value("${llm.openai.api-key}") String apiKey,
@Value("${llm.openai.timeout-ms:10000}") long timeoutMs) {
this.client = OpenAIOkHttpClient.builder()
.apiKey(apiKey)
.build();
this.timeout = Duration.ofMillis(timeoutMs);
}
@Override
public String name() { return "openai"; }
@Override
public CompletableFuture<String> complete(String prompt, int maxTokens) {
return CompletableFuture.supplyAsync(() -> {
var response = client.chat().completions().create(
ChatCompletionCreateParams.builder()
.model(ChatModel.GPT_4O_MINI)
.addUserMessage(prompt)
.maxTokens(maxTokens)
.build()
);
return response.choices().get(0).message().content().orElseThrow();
}).orTimeout(timeout.toMillis(), TimeUnit.MILLISECONDS);
}
@Override
public boolean isHealthy() {
// Implement lightweight health probe
return true;
}
}
Same pattern for Anthropic, Groq, and any other provider you want in the rotation. The gateway only ever sees LlmProvider.
The Gateway With Ordered Fallback
@Service
public class LlmGateway {
private final List<LlmProvider> providers;
private final MeterRegistry meterRegistry;
public LlmGateway(List<LlmProvider> providers, MeterRegistry meterRegistry) {
// Spring injects all LlmProvider beans in @Order sequence
this.providers = providers;
this.meterRegistry = meterRegistry;
}
public String complete(String prompt, int maxTokens) {
List<Exception> failures = new ArrayList<>();
for (LlmProvider provider : providers) {
if (!provider.isHealthy()) {
log.warn("Skipping unhealthy provider: {}", provider.name());
continue;
}
try {
long start = System.currentTimeMillis();
String result = provider.complete(prompt, maxTokens)
.get(15, TimeUnit.SECONDS);
long latency = System.currentTimeMillis() - start;
meterRegistry.counter("llm.requests",
"provider", provider.name(), "status", "success").increment();
meterRegistry.timer("llm.latency",
"provider", provider.name()).record(latency, TimeUnit.MILLISECONDS);
return result;
} catch (TimeoutException e) {
log.warn("Provider {} timed out", provider.name());
meterRegistry.counter("llm.requests",
"provider", provider.name(), "status", "timeout").increment();
failures.add(e);
} catch (Exception e) {
log.warn("Provider {} failed: {}", provider.name(), e.getMessage());
meterRegistry.counter("llm.requests",
"provider", provider.name(), "status", "error").increment();
failures.add(e);
}
}
throw new AllProvidersFailedException(
"All LLM providers failed", failures);
}
}
Provider order is controlled by @Order on each bean. Primary provider gets @Order(1), fallback gets @Order(2), and so on. Changing the fallback order is a configuration change, not a code change.
Per-Provider Timeout Configuration
Hard-coding timeouts inside provider implementations is the wrong pattern. Timeouts belong in configuration so they can be tuned without a deployment.
# application.yml
llm:
openai:
api-key: ${OPENAI_API_KEY}
timeout-ms: 10000
order: 1
anthropic:
api-key: ${ANTHROPIC_API_KEY}
timeout-ms: 15000
order: 2
groq:
api-key: ${GROQ_API_KEY}
timeout-ms: 8000
order: 3
Read these in each provider’s constructor via @Value. When Groq is fast and you want to try it first, change the order values and restart. No code changes required.
Health Checks via Spring Actuator
Expose provider health through Actuator so your monitoring system sees individual provider state:
@Component
public class LlmProvidersHealthIndicator implements HealthIndicator {
private final List<LlmProvider> providers;
public LlmProvidersHealthIndicator(List<LlmProvider> providers) {
this.providers = providers;
}
@Override
public Health health() {
Map<String, String> providerStatus = new LinkedHashMap<>();
boolean anyHealthy = false;
for (LlmProvider provider : providers) {
boolean healthy = provider.isHealthy();
providerStatus.put(provider.name(), healthy ? "UP" : "DOWN");
if (healthy) anyHealthy = true;
}
Health.Builder builder = anyHealthy ? Health.up() : Health.down();
return builder.withDetails(providerStatus).build();
}
}
Hit /actuator/health/llmProviders and your load balancer or Kubernetes liveness probe sees whether at least one provider is available. All providers down returns a DOWN status that triggers your alerting.
The Retry Policy That Does Not Hammer Rate-Limited Providers
A naive retry on any exception hammers rate-limited providers and accelerates the failure. Retry only on timeout and 5xx errors. Treat 429s as a signal to skip to the next provider immediately, not retry.
private boolean isRetryable(Exception e) {
if (e instanceof TimeoutException) return false; // handled by fallback
if (e.getMessage() != null) {
String msg = e.getMessage().toLowerCase();
// Skip retry on rate limits and auth failures
if (msg.contains("429") || msg.contains("rate limit")) return false;
if (msg.contains("401") || msg.contains("unauthorized")) return false;
// Retry on server errors
if (msg.contains("500") || msg.contains("503")) return true;
}
return false;
}
The gateway’s complete method calls the next provider in the fallback chain immediately on a 429. It does not retry the same provider. Rate limit responses are signals to route elsewhere, not to wait and retry.
What This Means For You
- Control provider order via
@Orderand configuration, not hard-coded logic. Changing which provider is primary should be a YAML edit and a restart, not a code review. - Emit metrics on every provider call with provider name and outcome tags. Dashboards on provider error rates and latency tell you which provider is degrading before your users do.
- Treat 429s as routing signals, not retry triggers. A rate-limited provider needs to be skipped immediately. Retrying a rate-limited endpoint with backoff burns time that your fallback could be using.
- Wire the health indicator to your load balancer probe so external traffic stops routing to your service only when all providers are down, not just the primary one.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg → AI news made simple without hype.
- Pithy Security → Stay ahead of cybersecurity threats.
