Cloudflare operates data centers in over 330 cities globally, but managing disruptive maintenance across this massive infrastructure requires careful coordination. Manual oversight alone can't guarantee that routine hardware updates won't inadvertently conflict with critical customer paths.
The Challenge: Coordinating Global Maintenance
As Cloudflare's network grew, tracking overlapping maintenance requests and customer-specific routing rules through manual coordination became nearly impossible. The company needed a centralized, automated system that could enforce safety constraints across the entire network simultaneously.
For example, edge routers that connect the public Internet to multiple data centers must never all go offline at once. Similarly, customers using Dedicated CDN Egress IPs (formerly "Aegis") choose specific data centers for low-latency traffic—if all chosen locations go offline simultaneously, those customers experience higher latency or errors.
Graph Processing Architecture
Cloudflare's solution leverages a graph-based interface inspired by Facebook's TAO research. The system models infrastructure as objects (vertices) and associations (edges), with typed relationships like "DATACENTER_INSIDE_AEGIS_POOL" that enable targeted queries.
This approach dramatically reduced API response sizes—by 100x overnight—by fetching only relevant data instead of loading entire datasets into memory. For instance, when scheduling maintenance in Frankfurt, Germany, the system only loads data for neighboring German data centers rather than the entire global network.
Smart Fetch Pipeline
To handle the increased number of smaller requests without breaching subrequest limits, Cloudflare built a middleware pipeline with:
- Request deduplication: Multiple identical requests wait on the same Promise
- LRU caching: Recently seen requests are cached in memory
- CDN caching: GET requests are cached regionally with appropriate TTLs
- Exponential backoff: Retries with jitter reduce wasted fetch calls
The result? A ~99% cache hit rate, meaning HTTP requests are served from fast cache memory rather than slower origin servers.
Real-Time Metrics with Thanos
The scheduler analyzes router health in real-time using Cloudflare's distributed Prometheus query engine, Thanos. By using the graph interface to find targeted relationships, query sizes dropped from multiple megabytes to approximately 1 KB—a 1000x reduction—while offloading deserialization to Thanos servers.
Impact and Results
The Workers-based maintenance scheduler now acts as a safeguard that can see the entire network state at once, programmatically enforcing safety constraints. It prevents simultaneous maintenance on related infrastructure, protects customer-specific routing requirements, and enables faster infrastructure operations without sacrificing reliability.
TL;DR
- Cloudflare built an automated maintenance scheduler on Workers to manage infrastructure across 330+ cities
- Graph-based architecture reduced API response sizes by 100x and query sizes by 1000x
- Smart fetch pipeline with caching achieved ~99% cache hit rate
- System prevents conflicts like simultaneous router outages or customer egress path disruptions
- Demonstrates how Workers can handle complex, mission-critical internal operations at global scale
Source: Cloudflare Blog: Building Our Maintenance Scheduler on Workers