Architecture 3 min read

Why We Chose Boring Technology for a 99.99% SLA

Table of contents

    A 99.99% availability commitment gives you 52 minutes and 36 seconds of downtime per year. Not per quarter - per year. When a client asks us to design for that number, the conversation about technology choices gets short very quickly.

    Our production stack is deliberately unexciting: AlmaLinux, nginx, PostgreSQL, and services written in languages our whole team can debug. This is not a lack of curiosity. It is a budget decision.

    Downtime budgets are spent on surprises

    Look at your last ten incidents. How many were caused by well-understood technology failing in a well-understood way, and how many by a component doing something nobody on the team could explain within the first hour?

    Every technology carries an operational surprise budget. Mature, widely deployed software has had its surprises found, documented, and turned into runbooks by thousands of teams before you. Newer or more complex systems make you the one writing the first runbook - during the incident.

    At four nines, one confusing incident can consume the entire year's budget. The math forces a preference for components whose failure modes are boring.

    What boring buys us, concretely

    Debuggability at 3 a.m. Every engineer on our on-call rotation can read a PostgreSQL query plan, an nginx error log, and a systemd journal. When the pager fires, there is no "wait for the one person who understands the mesh" step. Our median time to diagnosis is under 15 minutes, and this is the main reason.

    Upgrades as routine, not projects. Point releases of PostgreSQL and nginx are drama-free and well-announced. We patch on a monthly cadence without a war room. Teams we consult for who run ten-component stacks often skip patching because the interaction risk is unknowable - and unpatched software is its own availability risk.

    Failure modes with prior art. When a replica falls behind or a connection pool saturates, the answer is a search away because ten thousand teams have hit it before us. Prior art is an availability feature.

    Hiring and handoff. Enterprise clients keep systems for a decade. A stack a new hire can learn in a week keeps its bus factor healthy for that decade.

    Boring is a budget, not a religion

    The discipline is not "never use new technology". It is: you get a small number of novelty tokens, so spend them where they buy differentiation. We spent one building KDB, our NoSQL database, because storage engines are our business. We refuse to spend one on the load balancer, the queue, or the deploy pipeline, because a fascinating deploy pipeline makes us no money and can absolutely take us down.

    A test we use with clients evaluating a shiny component:

    1. What does this replace, and what is wrong with the boring version, stated in numbers?
    2. Who on the team can debug it alone on a Saturday?
    3. What is the exit cost if the project is abandoned in three years?

    Most proposals do not survive question one. The ones that do are usually worth it.

    The uncomfortable part

    Boring technology has a real cost: engineers want to grow, and "we run Postgres and it works" is a harder conference talk than a migration saga. We address that honestly - novelty budget goes into the product domain, where hard problems are plentiful, rather than into the infrastructure.

    Four nines is not achieved by heroics. It is achieved by removing the opportunities for surprise, one boring choice at a time.

    Copied