Obsfly
migration / online-alter · 47min · orders.created_at_v2liveALTER TABLE orders ADD COLUMN12:04:18shadow copy started12:04:19row-by-row backfill12:18:42 · 18.4M rowslock spike · p99 +280%12:42:01swap to new column12:48:03drop old column12:51:22

DevOps

Schema 迁移监控:发布而不让数据库宕机

在一张 10 亿行的表上执行 ALTER TABLE 是最让人怕的 30 行 PR。本文给出 Postgres / MySQL / MongoDB 在迁移前、中、后的监控方案。

Published ·10 min read

ALTER TABLE on a billion-row table is the most-feared 30-line PR in any backend repo. The migration worked in staging because staging is 100k rows. Production has 1B rows, an index that fits in cache, and the next thing you know you’ve held an exclusive lock for 47 minutes. The good news: the monitoring you need to catch this is cheap and lives in places you already have.

On this page
  1. Pre-flight checks
  2. Live during the migration
  3. Post-deploy bake-off
  4. DB-specific gotchas
  5. FAQ

Pre-flight checks (catch 90% of bad migrations)

  • Estimate row count of affected tables; if > 1M and the migration rewrites the table, plan for downtime or use online tooling.
  • Identify the lock level. Postgres docs list each ALTER’s lock; MySQL’s online DDL doc has the same. If it’s ACCESS EXCLUSIVE / TABLE LOCK, schedule it.
  • Check disk free. Many ALTERs duplicate the table; you need 2-3× the table size in free space.
  • Run the migration against a recent prod snapshot, not staging.
  • For MySQL: prefer ALGORITHM=INSTANT when supported, then INPLACE, then COPY or pt-online-schema-change.

Live monitoring during the migration

-- Postgres: are we waiting on a lock?
SELECT pid, query, wait_event_type, wait_event, state
FROM pg_stat_activity
WHERE datname = current_database()
  AND state != 'idle'
  AND query NOT ILIKE '%pg_stat_activity%';

-- Postgres: live progress (Pg 12+)
SELECT * FROM pg_stat_progress_create_index;
SELECT * FROM pg_stat_progress_cluster;

-- MySQL: ALTER stage progress (5.7+)
SELECT * FROM performance_schema.events_stages_current
WHERE EVENT_NAME LIKE 'stage/innodb/%';

Post-deploy bake-off

Migration finished. Now compare query telemetry against pre-migration:

  • Top 100 normalized queries — same signatures, same plan structures? Plan-flip detector catches regressions.
  • p99 per signature — within ±20% of yesterday? If not, your new index is colder than the old one or it’s not being used.
  • Lock-wait time — should be flat. A spike means a session is holding something open after migration.
  • Connection pool depth — should be stable. App-side connection storms are a side effect of restarts after migration.

DB-specific gotchas

  • Postgres CREATE INDEX CONCURRENTLY takes 2× as long but doesn’t lock writes. Always prefer it.
  • Postgres ALTER TABLE … ADD COLUMN with default: in Pg 11+ this is INSTANT for non-volatile defaults. Pre-11, it rewrites the whole table.
  • MySQL DROP INDEX: instant if it’s not the primary key. PK changes in 8.0.32+ are INSTANT for many cases.
  • MongoDB collMod: most are non-blocking. Adding validation rules is instant; changing schema is collection-wide and shows in currentOp.
  • ClickHouse ALTER … ADD COLUMN: instant on metadata. Backfilling values is a mutation that runs async; check system.mutations.

FAQ

Should I always use online schema change tools?+
For Postgres, native CONCURRENTLY is enough for indexes; pg_repack for table rewrites. For MySQL, pt-online-schema-change or gh-ost when native online DDL won't work.
How do I cancel a running migration safely?+
Postgres: cancel via pg_cancel_backend, then pg_terminate_backend if needed. MySQL: KILL the connection. Both leave state consistent — but partial indexes need cleanup.
Can I roll back?+
Plan rollback as a separate migration. Don't rely on transactional DDL — many ALTERs auto-commit. The rollback PR should be ready before the forward migration runs.

Keep reading

· · ·

像监控服务一样监控你的数据库。

预约 30 分钟演示。我们一起规划你的数据库规模,并报出第一个 30 天合作的报价。

Schema 迁移监控:发布而不让数据库宕机 · Obsfly