Why Does Stripe Use Kafka Instead of Just Calling Your API?

When Stripe processes a payment, it needs to notify your application, update its own internal systems, trigger fraud analysis, and record the event for audit purposes. Calling each destination synchronously chains all those systems together: if any one is slow or down, the entire payment flow stalls. Kafka decouples the event producer (Stripe’s payment processor) from every consumer, letting each system process events at its own rate without blocking anything else.

Analysis Briefing

Topic: Event-driven architecture, Kafka, and message queue design patterns
Analyst: Mike D (@MrComputerScience)
Context: Born from an exchange with Claude Sonnet 4.6 that refused to stay shallow
Source: Pithy Cyborg | AI News Made Simple
Key Question: What does a message queue give you that a synchronous API call fundamentally cannot?

The Problem With Synchronous API Calls at Scale

A synchronous HTTP call couples the caller to the callee. The caller waits for the callee to respond before proceeding. This creates three failure modes that are manageable with two services and catastrophic with twenty.

Cascading failures. If Service B is slow, Service A blocks waiting. If many requests pile up waiting for B, A exhausts its connection pool or thread pool and starts rejecting requests. The failure propagates upstream.

Temporal coupling. Both services must be available simultaneously. A deployment of Service B that introduces a 30-second restart creates errors in Service A during that window.

Throughput mismatch. If Service A produces events at 50,000 per second and Service B can process 10,000 per second, a synchronous architecture either throttles A or drops events. Neither is acceptable.

A message queue absorbs the mismatch. A produces at its rate, the queue buffers, B consumes at its rate. The two rates decouple entirely.

How Kafka Solves the Problem Differently From Traditional Queues

Traditional message queues (RabbitMQ, SQS) are designed around consumption: a message is delivered to a consumer and deleted. Kafka is designed around a persistent, ordered log. Messages are written to topics and retained for a configurable period (hours to indefinitely). Consumer groups track their own position (offset) in the log independently.

This has three consequences that matter at Stripe’s scale.

First, replay. If your fraud detection service has a bug and processes events incorrectly for 6 hours, you can deploy the fix and replay the last 6 hours of the payment events topic against the corrected service. Traditional queues cannot do this because the messages were deleted after first delivery.

Second, fan-out at no extra cost. Adding a new consumer of payment events requires zero changes to the producer. The new service subscribes to the topic and reads from offset 0 (or wherever it needs to start). Multiple independent consumers read the same log simultaneously.

Third, ordering guarantees within a partition. Kafka guarantees message ordering within a partition. If you partition payment events by customer ID, all events for a given customer arrive at the same consumer in order, enabling stateful processing without distributed coordination.

from confluent_kafka import Producer, Consumer

# Producer: Stripe's payment service
producer = Producer({'bootstrap.servers': 'kafka:9092'})

def on_payment_completed(payment: dict):
    producer.produce(
        topic='payment-events',
        key=payment['customer_id'].encode(),  # partition by customer
        value=json.dumps(payment).encode(),
        callback=lambda err, msg: log_delivery(err, msg)
    )
    producer.poll(0)  # non-blocking delivery

# Consumer: your webhook handler
consumer = Consumer({
    'bootstrap.servers': 'kafka:9092',
    'group.id': 'webhook-delivery',
    'auto.offset.reset': 'earliest'
})
consumer.subscribe(['payment-events'])

while True:
    msg = consumer.poll(1.0)
    if msg and not msg.error():
        payment = json.loads(msg.value())
        deliver_webhook(payment)
        consumer.commit(msg)  # explicit commit after successful delivery

When Not to Use Kafka

Kafka is operationally complex. It requires ZooKeeper or KRaft for cluster coordination, careful partition count planning (repartitioning later is painful), and monitoring of consumer lag to detect when consumers fall behind producers.

For applications processing under 10,000 events per second, needing fewer than 5 distinct consumer groups, and not requiring replay capability, a managed queue like SQS or a simple Redis Streams setup is a better fit. The operational overhead of Kafka is only justified by scale or the replay requirement.

Synchronous APIs remain the right choice for requests that need immediate responses. User-facing queries, authentication checks, and any operation where the caller needs the result before proceeding should stay synchronous. The rule is: use a message queue when the producer does not need to know whether the consumer succeeded, and use a synchronous API when it does.

What This Means For You

Identify which operations in your system truly need synchronous responses before reaching for a message queue, because async complexity is only justified when the alternative (synchronous coupling) creates real availability or throughput problems.
Design consumers to be idempotent before deploying any Kafka consumer to production, because at-least-once delivery means your consumer will occasionally process the same message twice and must handle that correctly.
Start with consumer lag as your primary Kafka health metric, because a consumer that cannot keep up with the producer will eventually exhaust disk on the broker and produce cascading failures more catastrophic than the synchronous architecture you replaced.
Use partition keys that distribute load evenly, because a poorly chosen partition key (like a status field with 3 values) sends all traffic to 3 partitions and limits parallelism regardless of how many partitions you create.

Enjoyed this deep dive? Join my inner circle:

Pithy Cyborg | AI News Made Simple → AI news made simple without hype.

Additional menu

Analysis Briefing

The Problem With Synchronous API Calls at Scale

How Kafka Solves the Problem Differently From Traditional Queues

When Not to Use Kafka

What This Means For You

Footer

Get My Latest Artificial Intelligence Newsletter For FREE