Reliability 2 min read

How We Cut Incident MTTR by 60% Without Hiring More On-Call Engineers

Table of contents

    Last year our on-call rotation had a problem that dashboards could not fix: engineers were spending more time figuring out what was wrong than fixing it. Mean time to detect (MTTD) looked healthy. Mean time to resolve (MTTR) did not.

    We had alerts. We had runbooks. We had enough incident commanders. What we lacked was a coherent lifecycle: too many context switches, too many places to look, and runbooks that described symptoms without telling you what to do next.

    The baseline

    In Q1 2024 our median MTTR for customer-impacting incidents was 87 minutes. Pager load was concentrated on a small group of senior engineers, partly because junior engineers avoided pages they did not feel equipped to handle.

    We mapped every incident from the previous two quarters and found three patterns:

    1. Duplicate alerts created noise and slowed triage.
    2. Runbooks lived in three places: Confluence, a GitHub wiki, and inline comments in Terraform.
    3. Communication overhead consumed 30-40% of the first hour.

    What we changed

    1. A single incident command channel, always

    We standardized on a Slack workflow that creates a dedicated incident channel, pins the on-call engineer and incident commander, and posts a timeline template. No more hunting for the right thread. The first message in every channel is the same.

    2. Executable runbooks

    We replaced static documents with small CLI-driven workflows. Each runbook is a Markdown file with fenced command blocks. A small internal tool called runbook renders them, prompts for variables, and executes commands in a safe shell.

    # Check replication lag across all primaries
    runbook exec postgres/replication-lag \
      --env production \
      --threshold-seconds 5

    The goal was not automation for its own sake. It was to make the next step obvious and reproducible.

    3. Alert quality score

    Every alert now requires a runbook link and a severity rationale. We review the top 20 noisiest alerts monthly. Alerts without a recent invocation are candidates for removal. Within six months we cut alert volume by 44%.

    33 minmedian MTTR
    44%alert volume reduction
    2.1xmore engineers taking pages
    Incident response improvements, Q1 2024 vs Q1 2025.

    What still needs work

    Runbooks age. We are experimenting with weekly "stale runbook" reminders tied to recent incident outcomes. We also want to improve handoff between regions when a page fires outside business hours.

    The biggest lesson, though, is cultural. Reducing MTTR was not about heroics. It was about removing friction so that more people could respond confidently.

    Copied