DevOps

Schema 迁移监控:发布而不让数据库宕机

在一张 10 亿行的表上执行 ALTER TABLE 是最让人怕的 30 行 PR。本文给出 Postgres / MySQL / MongoDB 在迁移前、中、后的监控方案。

Published 2026-04-02·10 min read

ALTER TABLE on a billion-row table is the most-feared 30-line PR in any backend repo. The migration worked in staging because staging is 100k rows. Production has 1B rows, an index that fits in cache, and the next thing you know you’ve held an exclusive lock for 47 minutes. The good news: the monitoring you need to catch this is cheap and lives in places you already have.

On this page

Pre-flight checks
Live during the migration
Post-deploy bake-off
DB-specific gotchas
FAQ

Pre-flight checks (catch 90% of bad migrations)

Estimate row count of affected tables; if > 1M and the migration rewrites the table, plan for downtime or use online tooling.
Identify the lock level. Postgres docs list each ALTER’s lock; MySQL’s online DDL doc has the same. If it’s ACCESS EXCLUSIVE / TABLE LOCK, schedule it.
Check disk free. Many ALTERs duplicate the table; you need 2-3× the table size in free space.
Run the migration against a recent prod snapshot, not staging.
For MySQL: prefer ALGORITHM=INSTANT when supported, then INPLACE, then COPY or pt-online-schema-change.

Live monitoring during the migration

-- Postgres: are we waiting on a lock?
SELECT pid, query, wait_event_type, wait_event, state
FROM pg_stat_activity
WHERE datname = current_database()
  AND state != 'idle'
  AND query NOT ILIKE '%pg_stat_activity%';

-- Postgres: live progress (Pg 12+)
SELECT * FROM pg_stat_progress_create_index;
SELECT * FROM pg_stat_progress_cluster;

-- MySQL: ALTER stage progress (5.7+)
SELECT * FROM performance_schema.events_stages_current
WHERE EVENT_NAME LIKE 'stage/innodb/%';

Post-deploy bake-off

Migration finished. Now compare query telemetry against pre-migration:

Top 100 normalized queries — same signatures, same plan structures? Plan-flip detector catches regressions.
p99 per signature — within ±20% of yesterday? If not, your new index is colder than the old one or it’s not being used.
Lock-wait time — should be flat. A spike means a session is holding something open after migration.
Connection pool depth — should be stable. App-side connection storms are a side effect of restarts after migration.

DB-specific gotchas

Postgres CREATE INDEX CONCURRENTLY takes 2× as long but doesn’t lock writes. Always prefer it.
Postgres ALTER TABLE … ADD COLUMN with default: in Pg 11+ this is INSTANT for non-volatile defaults. Pre-11, it rewrites the whole table.
MySQL DROP INDEX: instant if it’s not the primary key. PK changes in 8.0.32+ are INSTANT for many cases.
MongoDB collMod: most are non-blocking. Adding validation rules is instant; changing schema is collection-wide and shows in currentOp.
ClickHouse ALTER … ADD COLUMN: instant on metadata. Backfilling values is a mutation that runs async; check system.mutations.

FAQ

Should I always use online schema change tools?+

For Postgres, native CONCURRENTLY is enough for indexes; pg_repack for table rewrites. For MySQL, pt-online-schema-change or gh-ost when native online DDL won't work.

How do I cancel a running migration safely?+

Postgres: cancel via pg_cancel_backend, then pg_terminate_backend if needed. MySQL: KILL the connection. Both leave state consistent — but partial indexes need cleanup.

Can I roll back?+

Plan rollback as a separate migration. Don't rely on transactional DDL — many ALTERs auto-commit. The rollback PR should be ready before the forward migration runs.

Keep reading

Postgres

Postgres bloat and autovacuum: a 2026 tuning guide

What table and index bloat actually costs you, how autovacuum works in 16+, the parameters that matter, and the queries to find your worst offenders before they trigger an OOM.

Postgres

Postgres slow queries: 12 causes and how to find each one

A field-tested playbook for diagnosing a slow Postgres query in production — from missing indexes to plan flips to bloated tables — with the SQL to find each cause and the fix.

← All posts