reliability distributed-systems sre aws observability

52 Minutes a Year: What 99.99% Availability Actually Costs

April 6, 20267 min read

52 Minutes a Year: What 99.99% Availability Actually Costs | LikhithM

At 2:47 AM on a Tuesday, our on-call engineer got paged. Match results weren't being recorded. Players were finishing games, seeing a success screen, and then finding their stats unchanged.

The service was up. Health checks passing. No 500s. Error rate: 0%.

The database write was silently swallowing failures because someone had wrapped it in a catch block that logged and returned success. The service was available. It was not reliable.

That incident cost us three hours of corrupted match data and a week of player trust. It showed up nowhere in our uptime dashboard.

This is the gap between availability and reliability — and it's where most systems silently fail.

The Number Most Engineers Don't Know

99.99% availability means 52 minutes of downtime per year.

99.9% means 8.7 hours.

99% means 3.65 days.

Most teams aim for 99.9% and call it high availability. At 20,000 CCU during a live tournament, 8.7 hours of downtime means you have no business running a real-time platform. The math alone should set your target.

But here's the harder truth: hitting 99.99% isn't primarily an infrastructure problem. It's an engineering discipline problem. You can run on the best infrastructure in the world and still be unreliable because your code silently eats errors, your timeouts are wrong, or your deployment process has no rollback strategy.

Availability vs. Reliability

These are not the same thing. Your monitoring probably only tracks one of them.

Availability: Is the service responding? Reliability: Is the service giving correct answers?

A service can be 100% available and 0% reliable — if it responds to every request with wrong data. The silent failure I described at 2:47 AM is the canonical example.

Reliability requires:

Correct behavior under normal load
Correct behavior under failure conditions
Correct behavior during deployments and restarts
Idempotent operations so retries don't corrupt state

Most teams measure the first. Almost none systematically validate the other three.

The Three Reliability Taxes

Getting to 99.99% means paying three taxes up front. There are no shortcuts.

Tax 1: Graceful Degradation

When a dependency fails, your service should degrade — not die.

This sounds obvious until you have to implement it. Graceful degradation means deciding, for every external call your service makes: what do we return if this call fails or times out?

For our tournament platform:

Leaderboard service down → return cached leaderboard from 60 seconds ago, show "updating..." badge
Notification service down → queue notification locally, deliver later, never block the match flow
Profile service down → return anonymized data with username only, don't fail the match

The pattern is: every external dependency gets a fallback, a timeout, and a circuit breaker.

No fallback + slow dependency = your service hangs waiting for something that will never respond, exhausting your thread pool, and taking down everything upstream.

We use a 200ms timeout on all internal gRPC calls. Anything slower than 200ms is a degraded dependency, not a slow one — and we treat it accordingly.

Tax 2: Blast Radius Control

When something breaks, how much breaks with it?

The answer should always be: as little as possible. This is bulkhead design — the same principle that keeps a ship floating when one compartment floods.

In practice, this means:

Separate thread pools per dependency — a slow database doesn't exhaust the pool handling cache reads
Per-service rate limiting — one client's traffic spike doesn't degrade everyone else
Database connection isolation — match service and wallet service connect to separate connection pools, even if they hit the same Postgres instance

The test: can you deploy, restart, or degrade any single component without cascading to others? If not, you have an uncontrolled blast radius.

We learned this the hard way when a slow analytics query on a shared connection pool started timing out match writes. They were completely unrelated features. They shared a pool. One suffered, both died.

Tax 3: Recovery Time

MTTR (Mean Time to Recovery) matters more than MTTF (Mean Time to Failure). Things will break. The question is how fast you recover.

Fast recovery requires three things:

Automated rollback: Every deployment needs a one-command rollback. Ours is argocd app rollback gakbytes-api 1 — one command, 90-second rollback to the previous version. If you're manually reverting during an incident, your MTTR is 10x longer than it needs to be.

Readiness probes that actually test readiness: The default Kubernetes readiness probe checks if the HTTP server is running. That tells you nothing about whether the service is actually ready to handle traffic. Our readiness probe hits /health/ready which verifies: database connection pool healthy, Redis reachable, message queue consumer running. All three must pass before we route traffic.

Runbooks for the top 5 failure modes: Not wiki pages nobody reads — actual step-by-step commands in the alert itself. When your on-call engineer is paged at 2:47 AM, they should be running commands within 3 minutes, not searching Confluence.

The Failure Modes That Actually Kill SLAs

These are the ones that don't show up in your uptime dashboard.

Silent failures (the 2:47 AM incident): A try/catch that swallows errors and returns success. Use structured error logging and always propagate errors to the caller. Never return success when you've caught an exception.

Timeout misconfiguration: Services with no timeouts will eventually hang forever when a dependency slows down. Every outbound call needs an explicit timeout. Every inbound endpoint needs a request timeout. Default framework timeouts are almost always wrong.

Thundering herd on startup: When a service restarts, if it immediately receives full traffic before its caches are warm, it slams the database with cold reads. Add a 30-second startup delay to your readiness probe and pre-warm caches in your initialization sequence.

Deployment-induced failures: A deployment that replaces all pods simultaneously causes a service gap. Use rolling deployments with maxUnavailable: 0 — always bring up new pods before terminating old ones. Combined with PodDisruptionBudgets, this makes deployments invisible to users.

Retry amplification: Client retries + slow dependency = traffic amplification. If 10,000 clients each retry 3 times, a struggling service gets 30,000 requests instead of 10,000 — the exact opposite of what helps recovery. Always use exponential backoff with jitter. Never retry immediately.

The Observability You Need

Availability is easy to measure. Reliability is harder. These are the metrics that tell you the truth:

bytes

# Error budget burn rate — are you on track to meet your SLA this month?
1 - (
  sum(rate(http_requests_total{status!~"5.."}[1h]))
  /
  sum(rate(http_requests_total[1h]))
)

# Silent failure detector — requests that succeeded but wrote nothing
rate(db_writes_total[5m]) / rate(http_requests_total{handler="match_result",status="200"}[5m])

# Dependency health — what percentage of calls to each dependency succeed?
sum(rate(grpc_client_handled_total{code="OK"}[5m])) by (grpc_service)
/
sum(rate(grpc_client_handled_total[5m])) by (grpc_service)

# Deployment impact — did this deploy change error rates?
sum(rate(http_requests_total{status=~"5.."}[5m])) by (version)

That second query — write rate divided by success rate — is the one that would have caught our 2:47 AM incident in under 5 minutes. A healthy match service should write one result per completed match. If that ratio drops, something is silently failing.

What to Do Monday Morning

1. Find your silent failures. Search your codebase for catch blocks that return success without logging. Every one of those is a potential 2:47 AM incident waiting to happen.

2. Set an explicit timeout on every outbound call. No exceptions. 200ms for internal services, 2 seconds for external APIs. If you don't have timeouts, you don't have reliability.

3. Add a write-rate-to-success-rate ratio alert. For every critical write path, alert when writes per success drops below 0.95. This catches silent failures before your users report them.

52 minutes per year sounds achievable. And it is — but only if you treat reliability as a first-class engineering concern, not an operations problem. Every silent failure, every missing timeout, every missing rollback adds minutes to that budget.

The teams that hit 99.99% aren't running better hardware. They're writing better error handling.

#The Number Most Engineers Don't Know

#Availability vs. Reliability

#The Three Reliability Taxes

#Tax 1: Graceful Degradation

#Tax 2: Blast Radius Control

#Tax 3: Recovery Time

#The Failure Modes That Actually Kill SLAs

#The Observability You Need

#What to Do Monday Morning