Case Studies — Six Technical Deep Dives

Billing · Fintech / 01

Multi-tenant SaaS
Billing Meltdown → Rebuilt

Stripe webhooks were timing out. Duplicate invoices were silently generated. Support was drowning in angry tickets. The billing system had no idempotency, no observability, and no recovery path.

-68%Billing support tickets

99.9%Invoice consistency

0Silent failures post-deploy

The Problem

Stripe webhooks firing multiple times due to retries — no idempotency keys
Invoice jobs queued directly in the web process, timing out under load
No DLQ — failed jobs silently disappeared into the void
Support had no visibility into billing state beyond “it looks stuck”
Race conditions between webhook processing and scheduled jobs created duplicates

Architecture Before

Monolithic Laravel app processing Stripe webhooks synchronously within the HTTP request cycle. Scheduled Artisan commands ran billing in batch. No message queue. Retry logic was manual SQL updates by support engineers.

// BEFORE: fire-and-forget, no idempotency
public function handleWebhook(Request $req) {
  $event = Stripe::constructEvent(...);
  if ($event->type === 'invoice.paid') {
    Invoice::create([...]);  // duplicate risk!
  }
}
        

The Fix

Introduced Kafka topic billing.events — Stripe webhooks publish, workers consume
Idempotency keys on every event using Stripe event ID as partition key
SQS DLQ for failed consumers with a custom replay UI accessible by support
Separate worker ECS tasks with auto-scaling based on queue depth
Structured logging with Datadog integration — every invoice has a trace

// AFTER: idempotent Kafka consumer
class BillingEventConsumer {
  public function handle(BillingEvent $event): void {
    if (Invoice::whereIdempotencyKey(
      $event->stripe_event_id
    )->exists()) return;  // safe!

    DB::transaction(fn() => match($event->type) {
      'invoice.paid' => $this->markPaid($event),
      'invoice.failed' => $this->scheduleRetry($event),
    });
  }
}
        

Billing tickets dropped 68% in the first month. The DLQ replay UI reduced mean-time-to-recovery from hours to minutes. Zero duplicate invoices in the 6 months since rollout.

Tech Stack

Kafka (Confluent) Laravel 10 AWS ECS Fargate SQS DLQ Stripe Webhooks Redis Datadog

Infrastructure · Async Systems / 02

Invisible Dead Letters
Made Visible

Invoices stuck in “processing” forever. Support couldn't see why. Engineers were manually querying the database to diagnose failures. “Unknown failure” was the most common ticket category.

-80%Unknown-failure tickets

~4minAvg. resolution time

100%Failed job visibility

Root Cause

Failed queue jobs had no standard error schema — each failure was its own snowflake
SQS dead-letter queue existed but was never consumed or surfaced to humans
No retry policies — jobs either succeeded once or stayed stuck
Support team had zero tooling — engineers were their only recovery path

The DLQ Explorer

Built a React admin panel that surfaces every dead-lettered message with: full payload, failure reason, stack trace, and a one-click replay button. Retry policies are configurable per job type. Support can resolve 90% of failures without engineering involvement.

// Standardized job failure envelope
{
  "job_id": "inv_9f2c...",
  "job_type": "ProcessInvoicePayment",
  "failed_at": "2024-11-12T14:32:01Z",
  "attempt": 3,
  "max_attempts": 5,
  "reason": "Stripe rate limit exceeded",
  "payload": { "invoice_id": 4821, ... },
  "next_retry_at": "2024-11-12T14:37:01Z",
  "backoff_strategy": "exponential"
}
        

Automated Retry Policies

Exponential backoff: 1min → 5min → 25min → alert
Job-type-aware retry limits (payment jobs: 3x, notification jobs: 10x)
Slack alert on final DLQ entry with a direct link to the explorer
Weekly DLQ health report emailed to product leads

Support team now resolves 90% of async failures independently. MTTR dropped from >2 hours to 4 minutes. Engineers got their mornings back.

Tech Stack

SQS / SNS Laravel Jobs React Admin MySQL Slack API AWS CloudWatch

Compliance · Risk / 03

CORPRISK: Immutable
Audit Trail from Scratch

Auditors demanded a full immutable history of every data mutation. The existing system had none. Zoho CRM sync was flaky. Reconciliation was a 2-engineer, 3-day manual process per audit.

-90%Manual reconciliation

100%Event replay fidelity

4 linesIdempotency fix size

Constraints

Auditors needed retroactive history — existing data had no provenance
Zoho CRM integration was push-only — conflicts happened silently
Compliance required records be immutable — UPDATE and DELETE were off the table
System had to stay live with zero audit-unfriendly changes post-cutover

Event-Sourcing Design

Introduced an entity_events append-only table. Every domain action (create, update, status change) emits an event. Current state is derived by replaying events. A projection layer builds queryable read models for the UI and API.

-- Append-only event store
CREATE TABLE entity_events (
  id         BIGINT PRIMARY KEY AUTO_INCREMENT,
  entity_id  VARCHAR(36) NOT NULL,
  entity_type VARCHAR(64) NOT NULL,
  event_type VARCHAR(64) NOT NULL,
  payload    JSON NOT NULL,
  actor_id   BIGINT NOT NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  -- No UPDATE, no DELETE. Ever.
  INDEX (entity_id, entity_type)
);
        

Conflict Resolution

Zoho CRM changes arrive as events with source: "zoho"
Conflict detection via last_seen_event_id optimistic locking
Conflicting events queued to a review UI, not silently overwritten
Auditors get a full diff view for every conflict with actor identity

Manual reconciliation dropped 90%. The 2-engineer, 3-day audit process became a 4-hour automated report. Zero compliance findings in two external audits since launch.

Tech Stack

Laravel React MySQL (append-only) Zoho CRM API Event Sourcing

Performance · Database / 04

Report Dashboard:
8s → 0.8s p95 Latency

The analytics dashboard was so slow users stopped using it. Reports took 8 seconds on a good day, timed out on a bad one. No indexes, full table scans, no caching layer.

0.8sp95 latency (was 3.2s)

-75%Query execution time

0Timeouts post-deploy

Diagnosis

EXPLAIN ANALYZE revealed full table scans on a 14M-row orders table
N+1 queries in the report aggregation layer — hundreds of round trips per request
No read replicas — heavy analytical queries contending with OLTP writes
Dashboard hit the database fresh on every page load, no caching at any layer

-- BEFORE: full scan, no composite index
SELECT SUM(amount), DATE(created_at)
FROM orders
WHERE tenant_id = 42
  AND status = 'completed'
GROUP BY DATE(created_at);
-- Execution: 6,800ms. Rows examined: 14M

-- AFTER: composite index + date function fix
CREATE INDEX idx_orders_tenant_status_date
  ON orders (tenant_id, status, created_at);
-- Execution: 42ms. Rows examined: 8,200
        

The Three-Layer Fix

Layer 1 — Indexes: 8 composite indexes targeting the report query patterns. Immediate 85% query time reduction.
Layer 2 — Read Replica: Routed all SELECT queries from the dashboard to a read replica, eliminating write contention.
Layer 3 — Redis Cache: 5-minute TTL cache on pre-computed aggregates. Invalidated on write events. Dashboard loads from cache 94% of the time.

Caching Strategy

Used a cache-aside pattern with tenant-scoped cache keys. The background job pre-warms the most-accessed report combinations every 4 minutes. Cache hit rate sits at 94%. Cold cache still serves in <800ms thanks to the index optimizations.

p95 dropped from 3.2s to 0.8s. Zero timeouts since launch. Dashboard usage increased 340% in the following month because users actually started trusting the data.

Tech Stack

MySQL 8.0 AWS RDS Read Replica Redis 7 Laravel Cache Datadog APM

Platform · Traffic / 05

Edge Layer Hardening:
Viral Traffic Without Meltdown

A partner integration went viral on social — legitimate traffic spiked 22× in ten minutes. Naive rate limits kicked real users; APIs returned 503s; Redis ran hot. The fix wasn't “more servers” first — it was coherent degradation at the edge.

0Hard outages during spike

-94%503 error rate vs baseline panic

<120msp95 at gateway (cached)

The Problem

Global rate limit was per-IP — mobile carriers NAT'd thousands of users behind one address
Circuit breakers were binary: either full open or full closed — no graceful partial responses
Origin PHP workers saturated before autoscale caught up; queue depth became the real bottleneck
No request classification — marketing burst traffic hit the same pools as checkout webhooks

Architecture Shift

Introduced an AWS API Gateway + WAF front door with JWT-aware routes, tenant-scoped quotas, and separate usage plans for “burst” vs “critical path.” Heavy GET endpoints moved behind CloudFront with stale-while-revalidate; POST mutations stayed origin-only with token buckets per API key.

// Adaptive breaker + fallback payload (Node middleware sketch)
const breaker = new CircuitBreaker(callOrigin, {
  timeout: 2500, errorThresholdPercentage: 45,
  resetTimeout: 8000, volumeThreshold: 30,
});
breaker.fallback(() => ({
  status: 200,
  body: { degraded: true, etag: cachedEtag, data: lastGoodSnapshot },
}));
        

What Moved the Needle

Sliding-window limits per tenant_id + API key — not just IP
Priority lanes: webhook ingestion always reserved min concurrency
Breaker opened to JSON fallbacks for read-heavy dashboards — stale data beat hard errors
Dashboards in Datadog tied gateway 5xx to autoscale policies — alarms fired before users noticed

The spike became a non-event for core flows: zero hard outages, error budget intact, and product kept shipping instead of firefighting.

Tech Stack

AWS API Gateway CloudFront WAF Node.js Redis Datadog

Data · Real-time / 06

CDC Field Notes:
Cache Invalidation That Matches Reality

Redis caches were stamped with TTL guesswork — users saw stale inventory for minutes after warehouse updates. Nightly ETL was too slow; polling MySQL from app servers crushed replication lag. We needed change streams, not cron jobs.

<2sMedian visibility after DB commit

-87%Stale-read support tickets

1Source of truth (still MySQL)

The Problem

Cache keys were coarse (“tenant catalog”) — any SKU touch invalidated huge graphs
Binlog-based hacks existed but weren't wired to domain events — ops feared silent drift
Kafka was already in play for billing — duplicate pipelines weren't acceptable

Pipeline Design

Debezium connector on MySQL binlog → Kafka topic inventory.cdc → thin consumers that translate row-level commits into granular Redis DEL / SET operations and optional websocket fan-out for live dashboards.

-- CDC topic payload (simplified)
{
  "before": null,
  "after": { "sku_id": "SKU-991", "qty": 42, "warehouse": "LHR-01" },
  "op": "u",
  "ts_ms": 1735689600123
}
        

Safety Rails

Exactly-once semantics at consumer via idempotent key = binlog position + primary key
Poison messages sidelined to DLQ with replay UI (same patterns as Case 02)
Feature flag to fall back to TTL-only mode if connector lag exceeded SLO

Outcome

Support stopped arguing about “ghost stock.” Merchandising trusted the dashboard again because the cache finally tracked writes instead of hoping timeouts aligned with human patience.

Stale-read tickets dropped 87%; median propagation sat under 2 seconds with zero dual-write corruption.

Tech Stack

Debezium MySQL binlog Kafka Redis Laravel (consumers)

From “why is thison fire” to boring.

Multi-tenant SaaSBilling Meltdown → Rebuilt

The Problem

Architecture Before

The Fix

Tech Stack

Invisible Dead LettersMade Visible

Root Cause

The DLQ Explorer

Automated Retry Policies

Tech Stack

CORPRISK: ImmutableAudit Trail from Scratch

Constraints

Event-Sourcing Design

Conflict Resolution

Tech Stack

Report Dashboard:8s → 0.8s p95 Latency

Diagnosis

The Three-Layer Fix

Caching Strategy

Tech Stack

Edge Layer Hardening:Viral Traffic Without Meltdown

The Problem

Architecture Shift

What Moved the Needle

Tech Stack

CDC Field Notes:Cache Invalidation That Matches Reality

The Problem

Pipeline Design

Safety Rails

Outcome

Tech Stack

From “why is this
on fire” to boring.

Multi-tenant SaaS
Billing Meltdown → Rebuilt

Invisible Dead Letters
Made Visible

CORPRISK: Immutable
Audit Trail from Scratch

Report Dashboard:
8s → 0.8s p95 Latency

Edge Layer Hardening:
Viral Traffic Without Meltdown

CDC Field Notes:
Cache Invalidation That Matches Reality