The Challenge: Database at Global Scale
PostgreSQL powers core products like ChatGPT and OpenAI's API. After ChatGPT's launch, traffic grew at an unprecedented rate. OpenAI's PostgreSQL load grew by more than 10x over the past year.
OpenAI revealed a critical insight: PostgreSQL can reliably support much larger read-heavy workloads than previously thought possible. Today, a single primary Azure PostgreSQL instance with nearly 50 read replicas across regions handles millions of queries per second for 800 million users.
Architecture Challenges
Key problems encountered:
- Cache misses causing sudden database load spikes
- Expensive multi-way joins saturating CPU
- Write storms from feature launches
- MVCC write amplification
Optimization Strategies
Write Load Reduction
Migrate write-heavy workloads to Azure Cosmos DB. Optimize applications to minimize writes through lazy writes and strict rate limits.
Query Optimization
Avoid complex multi-table joins. Use PgBouncer for connection pooling, reducing latency from 50ms to 5ms.
High Availability & Replication
Run primary in HA mode with hot standby. Deploy 50 read replicas across regions. Test cascading replication for 100+ replicas.
Load Protection
Implement cache locking, multi-layer rate-limiting, and query-level blocking.
Results
- Millions of queries per second
- ~10ms p99 latency
- 99.999% availability
- Only 1 SEV-0 incident in 12 months
TL;DR
- Single-Primary: Works for read-heavy with 50 replicas across regions
- Write Optimization: Cosmos DB, lazy writes, strict rate limits
- HA Setup: Hot standby, multi-region replicas, read-replica failover
- Connection Pooling: PgBouncer reduces 50ms to 5ms
- Load Protection: Cache locking, rate-limiting, query blocking
- Performance: Millions QPS, 10ms p99, 99.999% availability
Source: OpenAI: Scaling PostgreSQL