Last year our on-call rotation had a problem that dashboards could not fix: engineers were spending more time figuring out what was wrong than fixing it. Mean time to detect (MTTD) looked healthy. Mean time to resolve (MTTR) did not.
We had alerts. We had runbooks. We had enough incident commanders. What we lacked was a coherent lifecycle: too many context switches, too many places to look, and runbooks that described symptoms without telling you what to do next.
The baseline
In Q1 2024 our median MTTR for customer-impacting incidents was 87 minutes. Pager load was concentrated on a small group of senior engineers, partly because junior engineers avoided pages they did not feel equipped to handle.
We mapped every incident from the previous two quarters and found three patterns:
- Duplicate alerts created noise and slowed triage.
- Runbooks lived in three places: Confluence, a GitHub wiki, and inline comments in Terraform.
- Communication overhead consumed 30-40% of the first hour.
What we changed
1. A single incident command channel, always
We standardized on a Slack workflow that creates a dedicated incident channel, pins the on-call engineer and incident commander, and posts a timeline template. No more hunting for the right thread. The first message in every channel is the same.
2. Executable runbooks
We replaced static documents with small CLI-driven workflows. Each runbook is a Markdown file with fenced command blocks. A small internal tool called runbook renders them, prompts for variables, and executes commands in a safe shell.
# Check replication lag across all primaries
runbook exec postgres/replication-lag \
--env production \
--threshold-seconds 5
The goal was not automation for its own sake. It was to make the next step obvious and reproducible.
3. Alert quality score
Every alert now requires a runbook link and a severity rationale. We review the top 20 noisiest alerts monthly. Alerts without a recent invocation are candidates for removal. Within six months we cut alert volume by 44%.
What still needs work
Runbooks age. We are experimenting with weekly "stale runbook" reminders tied to recent incident outcomes. We also want to improve handoff between regions when a page fires outside business hours.
The biggest lesson, though, is cultural. Reducing MTTR was not about heroics. It was about removing friction so that more people could respond confidently.