Failure Is the Default: Building Web Systems That Expect to Break

Last year I got paged for a production incident that had nothing to do with my code. Every line I had written was correct. Every test passed. Every deployment had gone through the full pipeline without a single warning. The system was down anyway, because a third-party authentication provider had a bad configuration push, and my application treated their availability as a given rather than a possibility. The code was fine. The architecture was the bug.

This is the experience that changed how I think about building for the web. Not the outage itself — outages happen — but the realization that I had written an entire system around an assumption I had never consciously made: that the things my code depends on will be there when it calls them. That assumption is wrong so often that it should be treated as the exception, not the rule. Failure is the default state of the web. The moments when everything works are the anomaly.

The Web Is Held Together with Hope

In November 2025, a configuration error in Cloudflare’s bot management system caused proxy failures that disrupted roughly 20% of global internet traffic. X, ChatGPT, Spotify, banking portals, public transportation systems — all down or degraded, not because of anything those services did, but because they shared an infrastructure dependency that had a bad day. The detail that stuck with me: even the outage-tracking services went down, because they, too, ran on Cloudflare. The tool you reach for to understand the failure was itself a casualty of the failure.

This was not an isolated event. A few months earlier, a Google IAM update broke authentication for third-party apps using Google sign-in, which cascaded into Cloudflare services, which cascaded into downstream applications that were not Google customers and had no direct relationship with Google at all. A three-tier cascade where the blast radius extended far beyond anyone who had made a conscious decision to depend on the failing system. In October, an Azure Front Door configuration change took out Azure SQL, Virtual Desktop, Microsoft 365, and multiple security products — services that were individually healthy but shared a CDN layer that became the single point of failure.

Each of these outages followed the same pattern: a single configuration change at an infrastructure provider cascading into failures across thousands of unrelated services. The providers involved are not incompetent. They are the best in the industry. The problem is structural. When individually rational decisions — everyone picks the most reliable CDN, the most popular auth provider, the best-in-class DNS resolver — converge on the same handful of providers, the result is a collectively fragile system. A monoculture, in the ecological sense. Efficient under normal conditions, catastrophically vulnerable to a single pathogen.

A recent survey found that 95% of enterprise executives acknowledge structural weaknesses in their infrastructure dependencies. Fewer than one-third perform regular failover testing. We know the floor is rotten. We just keep walking on it.

The Availability Assumption

Most code is written with an implicit contract: when I call this function, it will return a result. This contract holds perfectly for local function calls. It does not hold for anything that crosses a network boundary, and most modern applications are networks of network calls.

An API request looks syntactically similar to a local function call. In many frameworks, the abstraction is so clean that the network boundary is nearly invisible. You call a method, you get a response, you continue. But a local function call and an API request are fundamentally different operations. The local call will succeed or throw a known exception in microseconds. The API call can succeed, fail, time out, return garbage, return a cached stale response, succeed on the server but fail on the network during the response, or hang indefinitely without ever resolving. The failure modes are not just more numerous — they are qualitatively different, because they involve systems you do not control and cannot inspect.

Peter Deutsch published the fallacies of distributed computing in 1994. The first fallacy: the network is reliable. That was over thirty years ago, and we are still building production systems that implicitly assume it. This is not a knowledge gap. Every senior developer knows the network is unreliable. The problem is that our tools, frameworks, and habits make it easy to write code that ignores this knowledge. The default behavior of most HTTP clients is to wait indefinitely for a response. The default error handling is to crash. The default architecture is to treat every dependency as a hard requirement. Resilient behavior requires conscious, deliberate effort at every integration point. The path of least resistance produces fragile systems.

Fast Failure as a Design Principle

When a dependency goes down, the worst thing your system can do is wait. A request that hangs for 30 seconds while waiting for an unresponsive service is actively harmful in ways that a clean 503 in 5 milliseconds is not. The slow failure consumes a thread, a connection, memory. It backs up the request queue. Other requests that have nothing to do with the failing dependency start timing out because the system has no resources left. A single broken integration, given enough time, becomes a full system outage.

This is why the first question I ask about any external dependency is not “what does this return?” but “what happens when this doesn’t come back?” If the answer is “I don’t know” or “it waits,” there is a latent outage embedded in the architecture. It will surface at the worst possible time, because that is when dependencies fail — under load, during peak traffic, on the day your on-call engineer is at a conference.

Circuit breakers exist to formalize this thinking. The mechanism is simple: track failures, and when they exceed a threshold, stop sending requests entirely. Return a fallback or an error immediately instead of waiting for a timeout you already know is coming. After a cooling period, let a single request through to test whether the dependency has recovered. The concept fits in a few lines of pseudocode:

if recent_failures > threshold:
    return fallback_response     # fail fast, don't pile on
if dependency_call times out:
    record_failure()
    return fallback_response
else:
    reset_failure_count()
    return response

But the value of a circuit breaker is not in the code. It is in the design decision it represents: the explicit acknowledgment that this dependency will fail, and when it does, the system will do something other than wait and hope.

Not All Failures Are Equal

Once you accept that dependencies will fail, the next question is which failures matter. Not all of them do — at least, not equally.

An analytics service going down should not prevent a user from completing a purchase. A recommendation engine returning errors should not produce a blank page. A third-party chat widget failing to load should not block the rendering of your core content. Yet I have seen all of these happen in production, because the systems treated every dependency as equally critical. The analytics script threw an unhandled exception that bubbled up and killed the page. The recommendation API timed out, and the frontend waited for it before rendering anything. The chat widget loaded synchronously in the head of the document and blocked everything below it.

The fix is not technical complexity. It is classification. Every external dependency gets categorized: critical or non-critical. Critical failures mean the operation cannot complete — show an error, explain why, and make it possible to retry. Non-critical failures mean the operation completes with reduced functionality — hide the broken component, serve cached data, or simply omit the feature. The user may never notice.

This site is a small example of that philosophy in practice. It runs a service worker that uses cache-first for static assets and network-first for page content, with an offline fallback page as the last resort. If the CDN has a bad day, the CSS and JavaScript still load from cache. If the network goes down entirely, the user gets a meaningful offline page instead of a browser error. It is not complex engineering. It is a series of small decisions, each one answering the question: what should happen when this specific thing fails?

Test the Failure, Not Just the Feature

The patterns for building resilient systems — circuit breakers, timeouts, graceful degradation, fallback chains — are well-documented and have been for years. The gap is not in knowledge or tooling. The gap is in practice.

Most teams test the happy path exhaustively and the failure path never. The test suite verifies that the system works when every dependency is available, every response is well-formed, and every network call returns in milliseconds. This is the least interesting scenario, because it is the scenario where you do not need resilience. The scenario you need to test — the one that will actually determine whether your system survives production — is what happens when things go wrong.

If you have never killed a dependency in a staging environment and watched what your system does, you do not know how it handles failure. You have a theory. You might have written circuit breakers and fallbacks and timeout handling, and they might all work exactly as designed. But you have not verified it, and unverified assumptions in distributed systems have a way of surfacing as 3 AM pages.

Chaos engineering formalizes this into a practice: inject failures deliberately, observe the behavior, and fix the gaps before production finds them for you. But you do not need a full chaos engineering platform to start. You can block a downstream service at the network level and see what happens. You can set a dependency’s response time to 30 seconds and see whether your timeouts actually work. You can unplug the cache and see if the fallback serves stale data or crashes.

The systems that survive production are not the ones with the most sophisticated architectures. They are the ones built by people who expected them to break — and then verified that expectation before their users did.