Cloudflare's Code Orange: Fail Small Initiative After Major Outages

Following two significant global outages in November and December 2025, Cloudflare has launched "Code Orange: Fail Small"—a company-wide initiative to prevent similar incidents and build a more resilient network. The plan prioritizes this work above all else, requiring cross-functional teams to address systemic vulnerabilities in how configuration changes are deployed.

What Went Wrong

Both outages followed a similar pattern: configuration changes were deployed instantaneously across hundreds of data centers worldwide, triggering cascade failures that took down significant portions of Cloudflare's network.

The November 18 incident involved an automatic update to Bot Management classifiers that ran for approximately two hours and ten minutes. The December 5 outage, triggered by a security tool configuration change to protect against a React vulnerability, affected 28% of applications for about 25 minutes.

The root cause? While Cloudflare applies rigorous controlled rollouts to software binary releases—with multiple gates, gradual deployment, and automatic rollback—these same safeguards weren't applied to configuration changes.

Three-Pillar Response Plan

Cloudflare's Code Orange work is organized into three main areas:

1. Controlled Configuration Rollouts

Configuration changes will now follow the same Health Mediated Deployment (HMD) process used for software releases. Instead of propagating globally in seconds via Quicksilver, changes will:

- Deploy first to employee traffic
- Roll out gradually to increasing customer percentages
- Monitor defined success metrics at each gate
- Automatically rollback if anomalies are detected

This represents a fundamental shift in treating configuration with the same caution as code.

2. Improved Failure Mode Handling

Cloudflare is reviewing interface contracts between all critical products and services to ensure failures are contained rather than cascading. The Bot Management incident, for example, exposed two key interfaces that should have handled failures gracefully:

- The config file reader should have used validated defaults instead of panicking
- The Bot Management module failure shouldn't have dropped traffic—it should have allowed traffic to pass with basic classification

By assuming failure will occur at every interface and handling it appropriately, individual component failures won't trigger network-wide outages.

3. Emergency Response Improvements

Slow incident response was worsened by security systems preventing team access and circular dependencies. During the November incident, Turnstile (Cloudflare's CAPTCHA solution) became unavailable—but since Turnstile protects dashboard logins, customers without active sessions couldn't access Cloudflare to make critical changes.

Cloudflare is reviewing all "break glass" procedures to ensure:

- Right people have right access during emergencies
- Circular dependencies are removed or bypassable
- Training exercises are frequent enough to ensure familiarity

Timeline and Commitment

By the end of Q1 2026, Cloudflare aims to:

- Cover all production systems with Health Mediated Deployments for configuration
- Update systems to adhere to proper failure modes
- Establish processes for proper emergency remediation access

Some goals, like managing circular dependencies and updating break glass procedures, will be evergreen as the platform evolves.

Industry Implications

Cloudflare's transparency about these failures and their response plan offers valuable lessons for the infrastructure industry. The distinction between software deployment rigor and configuration change speed represents a common blind spot—one that becomes critical at scale.

The "Fail Small" philosophy acknowledges that mistakes will occur. The goal isn't perfection, but containment: ensuring that when failures happen, they affect minimal services for minimal time.

TL;DR

- Cloudflare experienced two major outages in late 2025 due to instantaneous global configuration deployments
- "Code Orange: Fail Small" initiative prioritizes building resilience above all other work
- Configuration changes will now follow same controlled rollout process as software releases
- All service interfaces being reviewed to ensure failures don't cascade
- Emergency access procedures and circular dependencies being addressed
- Q1 2026 target for completing core resilience improvements

Source: Cloudflare Blog: Code Orange - Fail Small