Obsfly
mysql / replication · seconds_behind_masterlivealert at 30s · spike at 18:42 · catch-up via parallel workersalert · 30speak · 47s

MySQL

MySQL 副本延迟:按发生频率排列的 9 个原因

Seconds_Behind_Master 是个会撒谎的整数。这里给出真正的诊断顺序 — 单线程 apply、长事务、schema 迁移、网络 — 每种原因都附 SQL。

Published ·10 min read

Seconds_Behind_Master is a lying integer. It’s a delta between two clocks that could already be skewed, computed only when the IO thread is healthy. When you’re lagging, you need to know why, not how much. Here are the nine common causes ranked by how often they’re the culprit on a real production cluster, with the SQL to confirm each.

On this page
  1. First two minutes
  2. Nine causes, ranked
  3. Parallel replication changes the picture
  4. What to monitor
  5. FAQ

First two minutes

SHOW REPLICA STATUS\G  -- (or SHOW SLAVE STATUS on 5.7)

-- Look at:
-- Replica_IO_Running, Replica_SQL_Running    (must be Yes)
-- Last_IO_Errno, Last_SQL_Errno              (must be 0)
-- Seconds_Behind_Source                      (the lying integer — but useful as a trend)
-- Replica_SQL_Running_State                  (the SQL thread's current activity)

If Replica_SQL_Running_State is “Reading event from the relay log” but progressing slowly, you’re bottlenecked on apply. If it’s “Waiting for an event from Coordinator” for long stretches, you’re bottlenecked on parallel apply slot allocation.

Nine causes, ranked by frequency

#CauseHow to confirm
1Single-threaded apply on a write-heavy primaryCheck replica_parallel_workers; if 1, that’s your fix
2Long transaction holding a lockReplica side: SHOW PROCESSLIST, look for the apply thread waiting
3DDL replicationSELECT * FROM performance_schema.events_statements_history on replica
4Network bandwidth saturationCompare relay-log growth rate vs. binlog generation rate
5Disk IO saturation on replicaiostat / Performance Schema file_summary_by_instance
6Different binlog_format causing extra workRBS replicating row events while replica needs statement-level
7Large GTID transaction (one statement = whole table)Check Executed_Gtid_Set diff
8Replica running heavy reads (BI queries)Replica QPS > expected from app reads
9Hardware downgrade vs primaryReplica instance class smaller than primary

Parallel replication changes the picture

From 8.0 with replica_parallel_type=LOGICAL_CLOCK and replica_parallel_workers > 1, multiple workers apply non-conflicting transactions in parallel. Setup:

SET GLOBAL replica_parallel_type = 'LOGICAL_CLOCK';
SET GLOBAL replica_parallel_workers = 8;
SET GLOBAL replica_preserve_commit_order = ON;

What to monitor (and what to alert on)

  • Lag in binlog bytes (not seconds) — accurate even when the SQL thread is stuck.
  • Apply throughput in events/s; sudden drop = SQL thread blocked.
  • Worker imbalance from replication_applier_status_by_worker — alert when 95th-percentile worker lag exceeds median by 5×.
  • GTID gaps (Retrieved_Gtid_Set vs Executed_Gtid_Set) for unexpected pauses.

FAQ

Is Seconds_Behind_Source ever accurate?+
When IO and SQL threads are healthy and clocks aren't skewed, yes. Use it for trend, not for debugging.
Should I disable replica_preserve_commit_order for max throughput?+
Only on read-only replicas where reads don't depend on causal consistency. Otherwise keep it on.
GTID vs binlog position — which to track?+
GTID is canonical in modern setups. Binlog position is still useful when comparing relay-log positions on the replica side.

Keep reading

· · ·

像监控服务一样监控你的数据库。

预约 30 分钟演示。我们一起规划你的数据库规模,并报出第一个 30 天合作的报价。

MySQL 副本延迟:按发生频率排列的 9 个原因 · Obsfly