The Challenge: Database at Global Scale

PostgreSQL powers core products like ChatGPT and OpenAI's API. After ChatGPT's launch, traffic grew at an unprecedented rate. OpenAI's PostgreSQL load grew by more than 10x over the past year.

OpenAI revealed a critical insight: PostgreSQL can reliably support much larger read-heavy workloads than previously thought possible. Today, a single primary Azure PostgreSQL instance with nearly 50 read replicas across regions handles millions of queries per second for 800 million users.

Architecture Challenges

Key problems encountered:
- Cache misses causing sudden database load spikes
- Expensive multi-way joins saturating CPU
- Write storms from feature launches
- MVCC write amplification

Optimization Strategies

Write Load Reduction

Migrate write-heavy workloads to Azure Cosmos DB. Optimize applications to minimize writes through lazy writes and strict rate limits.

Query Optimization

Avoid complex multi-table joins. Use PgBouncer for connection pooling, reducing latency from 50ms to 5ms.

High Availability & Replication

Run primary in HA mode with hot standby. Deploy 50 read replicas across regions. Test cascading replication for 100+ replicas.

Load Protection

Implement cache locking, multi-layer rate-limiting, and query-level blocking.

Results

- Millions of queries per second
- ~10ms p99 latency
- 99.999% availability
- Only 1 SEV-0 incident in 12 months

TL;DR

- Single-Primary: Works for read-heavy with 50 replicas across regions
- Write Optimization: Cosmos DB, lazy writes, strict rate limits
- HA Setup: Hot standby, multi-region replicas, read-replica failover
- Connection Pooling: PgBouncer reduces 50ms to 5ms
- Load Protection: Cache locking, rate-limiting, query blocking
- Performance: Millions QPS, 10ms p99, 99.999% availability

Source: OpenAI: Scaling PostgreSQL