Zero-Downtime Database Migrations at Scale: Patterns That Survived Production

Every team says their migrations are zero-downtime until a ten-second lock on a hot table takes checkout offline. We have run schema changes on PostgreSQL clusters serving thousands of transactions per second, for our own products and for clients, and the patterns below are the ones that survived. Each rule exists because we, or a client, learned it the hard way.

The core pattern: expand, migrate, contract

Every non-trivial schema change becomes three deployments, not one:

Expand - add the new column, table, or index alongside the old one. The application ignores it.
Migrate - dual-write to old and new, backfill historical rows in batches, then switch reads to the new structure behind a flag.
Contract - once nothing reads the old structure, drop it in a later release.

It feels slow. It is also the only approach we have never had to roll back at 3 a.m. The single-deployment "rename the column and update the code atomically" plan assumes your deploy is atomic across every app server. It is not.

Rules we enforce in review

Always set lock_timeout

Almost every DDL statement in PostgreSQL needs a lock that queues behind long-running queries - and everything else then queues behind your DDL. A blocked ALTER TABLE can take a service down without doing any work at all.

SET lock_timeout = '2s';
ALTER TABLE orders ADD COLUMN fulfillment_channel text;

If the migration cannot get its lock in two seconds, it fails and we retry off-peak. Failing fast is a feature.

Indexes are always CONCURRENTLY

CREATE INDEX CONCURRENTLY cannot run in a transaction and can leave an invalid index behind on failure, so our migration runner checks for and drops invalid indexes before retrying. That is still a far better failure mode than the write lock a plain CREATE INDEX takes on a hot table.

Backfills are batched, resumable, and boring

Backfilling a billion-row table is an operational task, not a SQL statement. Our backfill jobs process fixed-size batches with a progress cursor stored in the database, throttle on replication lag, and can be stopped and resumed at any time.

UPDATE orders
SET fulfillment_channel = 'web'
WHERE id BETWEEN :cursor AND :cursor + 10000
  AND fulfillment_channel IS NULL;

The throttle matters more than the batch size. The first version of our backfill runner pushed replica lag past 30 seconds and turned a routine change into a read-consistency incident.

NOT NULL comes last, and in steps

Adding a NOT NULL constraint validates every row under an exclusive lock. We add the constraint as NOT VALID, then VALIDATE CONSTRAINT separately - validation takes only a light lock and can run for hours without blocking writes.

The part that is not about SQL

The riskiest migrations we have seen in consulting engagements failed for organizational reasons, not technical ones: the schema change and the code change were owned by different teams, deployed on different schedules, with no shared definition of "done". The contract step never happened, dual-write code stayed in place for a year, and the next migration had to reason about both.

Our fix is procedural. Every migration gets a tracking issue with all three phases listed at creation time, and the contract step has an owner and a target date before the expand step ships. Unfinished migrations are debt with a face attached.

0migration rollbacks since adopting the playbook

2sstandard lock_timeout for DDL

3deployments per schema change

What disciplined migrations look like in practice.

Slow is smooth, and smooth is fast. Three careful deployments beat one exciting one every time.