Notes on distributed systems, infrastructure, and the craft of software engineering.

Scaling Postgres beyond a single primary

After hitting 50k QPS our database started showing strain. Here's how we approached read replicas, connection pooling, and eventually moved to a partitioned setup.

The hidden cost of event-driven architecture

Event-driven systems promise loose coupling, but the operational complexity is real. Debugging cascading failures across 30 services taught us when synchronous calls are actually fine.

Why we capped our observability budget

Datadog bill creeping toward 6 figures? You're not alone. We cut spend by 60% without losing visibility.

Anatomy of a 4-hour outage

A misconfigured Kafka consumer triggered cascading failures across our entire payment pipeline. Full timeline, mistakes made.