Multi-tenant SaaS
Billing Meltdown → Rebuilt
Stripe webhooks were timing out. Duplicate invoices were silently generated. Support was drowning in angry tickets. The billing system had no idempotency, no observability, and no recovery path.
The Problem
- Stripe webhooks firing multiple times due to retries — no idempotency keys
- Invoice jobs queued directly in the web process, timing out under load
- No DLQ — failed jobs silently disappeared into the void
- Support had no visibility into billing state beyond “it looks stuck”
- Race conditions between webhook processing and scheduled jobs created duplicates
Architecture Before
Monolithic Laravel app processing Stripe webhooks synchronously within the HTTP request cycle. Scheduled Artisan commands ran billing in batch. No message queue. Retry logic was manual SQL updates by support engineers.
The Fix
- Introduced Kafka topic
billing.events— Stripe webhooks publish, workers consume - Idempotency keys on every event using Stripe event ID as partition key
- SQS DLQ for failed consumers with a custom replay UI accessible by support
- Separate worker ECS tasks with auto-scaling based on queue depth
- Structured logging with Datadog integration — every invoice has a trace
Billing tickets dropped 68% in the first month. The DLQ replay UI reduced mean-time-to-recovery from hours to minutes. Zero duplicate invoices in the 6 months since rollout.