Billing · Fintech / 01

Multi-tenant SaaS
Billing Meltdown → Rebuilt

Stripe webhooks were timing out. Duplicate invoices were silently generated. Support was drowning in angry tickets. The billing system had no idempotency, no observability, and no recovery path.

-68%Billing support tickets
99.9%Invoice consistency
0Silent failures post-deploy

The Problem

  • Stripe webhooks firing multiple times due to retries — no idempotency keys
  • Invoice jobs queued directly in the web process, timing out under load
  • No DLQ — failed jobs silently disappeared into the void
  • Support had no visibility into billing state beyond “it looks stuck”
  • Race conditions between webhook processing and scheduled jobs created duplicates

Architecture Before

Monolithic Laravel app processing Stripe webhooks synchronously within the HTTP request cycle. Scheduled Artisan commands ran billing in batch. No message queue. Retry logic was manual SQL updates by support engineers.

// BEFORE: fire-and-forget, no idempotency public function handleWebhook(Request $req) { $event = Stripe::constructEvent(...); if ($event->type === 'invoice.paid') { Invoice::create([...]); // duplicate risk! } }

The Fix

  • Introduced Kafka topic billing.events — Stripe webhooks publish, workers consume
  • Idempotency keys on every event using Stripe event ID as partition key
  • SQS DLQ for failed consumers with a custom replay UI accessible by support
  • Separate worker ECS tasks with auto-scaling based on queue depth
  • Structured logging with Datadog integration — every invoice has a trace
// AFTER: idempotent Kafka consumer class BillingEventConsumer { public function handle(BillingEvent $event): void { if (Invoice::whereIdempotencyKey( $event->stripe_event_id )->exists()) return; // safe! DB::transaction(fn() => match($event->type) { 'invoice.paid' => $this->markPaid($event), 'invoice.failed' => $this->scheduleRetry($event), }); } }

Billing tickets dropped 68% in the first month. The DLQ replay UI reduced mean-time-to-recovery from hours to minutes. Zero duplicate invoices in the 6 months since rollout.

Tech Stack

Kafka (Confluent) Laravel 10 AWS ECS Fargate SQS DLQ Stripe Webhooks Redis Datadog
Infrastructure · Async Systems / 02

Invisible Dead Letters
Made Visible

Invoices stuck in “processing” forever. Support couldn't see why. Engineers were manually querying the database to diagnose failures. “Unknown failure” was the most common ticket category.

-80%Unknown-failure tickets
~4minAvg. resolution time
100%Failed job visibility

Root Cause

  • Failed queue jobs had no standard error schema — each failure was its own snowflake
  • SQS dead-letter queue existed but was never consumed or surfaced to humans
  • No retry policies — jobs either succeeded once or stayed stuck
  • Support team had zero tooling — engineers were their only recovery path

The DLQ Explorer

Built a React admin panel that surfaces every dead-lettered message with: full payload, failure reason, stack trace, and a one-click replay button. Retry policies are configurable per job type. Support can resolve 90% of failures without engineering involvement.

// Standardized job failure envelope { "job_id": "inv_9f2c...", "job_type": "ProcessInvoicePayment", "failed_at": "2024-11-12T14:32:01Z", "attempt": 3, "max_attempts": 5, "reason": "Stripe rate limit exceeded", "payload": { "invoice_id": 4821, ... }, "next_retry_at": "2024-11-12T14:37:01Z", "backoff_strategy": "exponential" }

Automated Retry Policies

  • Exponential backoff: 1min → 5min → 25min → alert
  • Job-type-aware retry limits (payment jobs: 3x, notification jobs: 10x)
  • Slack alert on final DLQ entry with a direct link to the explorer
  • Weekly DLQ health report emailed to product leads

Support team now resolves 90% of async failures independently. MTTR dropped from >2 hours to 4 minutes. Engineers got their mornings back.

Tech Stack

SQS / SNS Laravel Jobs React Admin MySQL Slack API AWS CloudWatch
Compliance · Risk / 03

CORPRISK: Immutable
Audit Trail from Scratch

Auditors demanded a full immutable history of every data mutation. The existing system had none. Zoho CRM sync was flaky. Reconciliation was a 2-engineer, 3-day manual process per audit.

-90%Manual reconciliation
100%Event replay fidelity
4 linesIdempotency fix size

Constraints

  • Auditors needed retroactive history — existing data had no provenance
  • Zoho CRM integration was push-only — conflicts happened silently
  • Compliance required records be immutable — UPDATE and DELETE were off the table
  • System had to stay live with zero audit-unfriendly changes post-cutover

Event-Sourcing Design

Introduced an entity_events append-only table. Every domain action (create, update, status change) emits an event. Current state is derived by replaying events. A projection layer builds queryable read models for the UI and API.

-- Append-only event store CREATE TABLE entity_events ( id BIGINT PRIMARY KEY AUTO_INCREMENT, entity_id VARCHAR(36) NOT NULL, entity_type VARCHAR(64) NOT NULL, event_type VARCHAR(64) NOT NULL, payload JSON NOT NULL, actor_id BIGINT NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, -- No UPDATE, no DELETE. Ever. INDEX (entity_id, entity_type) );

Conflict Resolution

  • Zoho CRM changes arrive as events with source: "zoho"
  • Conflict detection via last_seen_event_id optimistic locking
  • Conflicting events queued to a review UI, not silently overwritten
  • Auditors get a full diff view for every conflict with actor identity

Manual reconciliation dropped 90%. The 2-engineer, 3-day audit process became a 4-hour automated report. Zero compliance findings in two external audits since launch.

Tech Stack

Laravel React MySQL (append-only) Zoho CRM API Event Sourcing
Performance · Database / 04

Report Dashboard:
8s → 0.8s p95 Latency

The analytics dashboard was so slow users stopped using it. Reports took 8 seconds on a good day, timed out on a bad one. No indexes, full table scans, no caching layer.

0.8sp95 latency (was 3.2s)
-75%Query execution time
0Timeouts post-deploy

Diagnosis

  • EXPLAIN ANALYZE revealed full table scans on a 14M-row orders table
  • N+1 queries in the report aggregation layer — hundreds of round trips per request
  • No read replicas — heavy analytical queries contending with OLTP writes
  • Dashboard hit the database fresh on every page load, no caching at any layer
-- BEFORE: full scan, no composite index SELECT SUM(amount), DATE(created_at) FROM orders WHERE tenant_id = 42 AND status = 'completed' GROUP BY DATE(created_at); -- Execution: 6,800ms. Rows examined: 14M -- AFTER: composite index + date function fix CREATE INDEX idx_orders_tenant_status_date ON orders (tenant_id, status, created_at); -- Execution: 42ms. Rows examined: 8,200

The Three-Layer Fix

  • Layer 1 — Indexes: 8 composite indexes targeting the report query patterns. Immediate 85% query time reduction.
  • Layer 2 — Read Replica: Routed all SELECT queries from the dashboard to a read replica, eliminating write contention.
  • Layer 3 — Redis Cache: 5-minute TTL cache on pre-computed aggregates. Invalidated on write events. Dashboard loads from cache 94% of the time.

Caching Strategy

Used a cache-aside pattern with tenant-scoped cache keys. The background job pre-warms the most-accessed report combinations every 4 minutes. Cache hit rate sits at 94%. Cold cache still serves in <800ms thanks to the index optimizations.

p95 dropped from 3.2s to 0.8s. Zero timeouts since launch. Dashboard usage increased 340% in the following month because users actually started trusting the data.

Tech Stack

MySQL 8.0 AWS RDS Read Replica Redis 7 Laravel Cache Datadog APM
Platform · Traffic / 05

Edge Layer Hardening:
Viral Traffic Without Meltdown

A partner integration went viral on social — legitimate traffic spiked 22× in ten minutes. Naive rate limits kicked real users; APIs returned 503s; Redis ran hot. The fix wasn't “more servers” first — it was coherent degradation at the edge.

0Hard outages during spike
-94%503 error rate vs baseline panic
<120msp95 at gateway (cached)

The Problem

  • Global rate limit was per-IP — mobile carriers NAT'd thousands of users behind one address
  • Circuit breakers were binary: either full open or full closed — no graceful partial responses
  • Origin PHP workers saturated before autoscale caught up; queue depth became the real bottleneck
  • No request classification — marketing burst traffic hit the same pools as checkout webhooks

Architecture Shift

Introduced an AWS API Gateway + WAF front door with JWT-aware routes, tenant-scoped quotas, and separate usage plans for “burst” vs “critical path.” Heavy GET endpoints moved behind CloudFront with stale-while-revalidate; POST mutations stayed origin-only with token buckets per API key.

// Adaptive breaker + fallback payload (Node middleware sketch) const breaker = new CircuitBreaker(callOrigin, { timeout: 2500, errorThresholdPercentage: 45, resetTimeout: 8000, volumeThreshold: 30, }); breaker.fallback(() => ({ status: 200, body: { degraded: true, etag: cachedEtag, data: lastGoodSnapshot }, }));

What Moved the Needle

  • Sliding-window limits per tenant_id + API key — not just IP
  • Priority lanes: webhook ingestion always reserved min concurrency
  • Breaker opened to JSON fallbacks for read-heavy dashboards — stale data beat hard errors
  • Dashboards in Datadog tied gateway 5xx to autoscale policies — alarms fired before users noticed

The spike became a non-event for core flows: zero hard outages, error budget intact, and product kept shipping instead of firefighting.

Tech Stack

AWS API Gateway CloudFront WAF Node.js Redis Datadog
Data · Real-time / 06

CDC Field Notes:
Cache Invalidation That Matches Reality

Redis caches were stamped with TTL guesswork — users saw stale inventory for minutes after warehouse updates. Nightly ETL was too slow; polling MySQL from app servers crushed replication lag. We needed change streams, not cron jobs.

<2sMedian visibility after DB commit
-87%Stale-read support tickets
1Source of truth (still MySQL)

The Problem

  • Cache keys were coarse (“tenant catalog”) — any SKU touch invalidated huge graphs
  • Binlog-based hacks existed but weren't wired to domain events — ops feared silent drift
  • Kafka was already in play for billing — duplicate pipelines weren't acceptable

Pipeline Design

Debezium connector on MySQL binlog → Kafka topic inventory.cdc → thin consumers that translate row-level commits into granular Redis DEL / SET operations and optional websocket fan-out for live dashboards.

-- CDC topic payload (simplified) { "before": null, "after": { "sku_id": "SKU-991", "qty": 42, "warehouse": "LHR-01" }, "op": "u", "ts_ms": 1735689600123 }

Safety Rails

  • Exactly-once semantics at consumer via idempotent key = binlog position + primary key
  • Poison messages sidelined to DLQ with replay UI (same patterns as Case 02)
  • Feature flag to fall back to TTL-only mode if connector lag exceeded SLO

Outcome

Support stopped arguing about “ghost stock.” Merchandising trusted the dashboard again because the cache finally tracked writes instead of hoping timeouts aligned with human patience.

Stale-read tickets dropped 87%; median propagation sat under 2 seconds with zero dual-write corruption.

Tech Stack

Debezium MySQL binlog Kafka Redis Laravel (consumers)