Obsfly
mysql / replication · seconds_behind_masterlivealert at 30s · spike at 18:42 · catch-up via parallel workersalert · 30speak · 47s

MySQL

Lag de réplica MySQL: 9 causas, ordenadas por frecuencia

Seconds_Behind_Master es un entero mentiroso. Aquí va el orden real de diagnóstico — apply mono-hilo, transacciones largas, migraciones de esquema, red — con el SQL para cada una.

Published ·10 min read

Seconds_Behind_Master is a lying integer. It’s a delta between two clocks that could already be skewed, computed only when the IO thread is healthy. When you’re lagging, you need to know why, not how much. Here are the nine common causes ranked by how often they’re the culprit on a real production cluster, with the SQL to confirm each.

On this page
  1. First two minutes
  2. Nine causes, ranked
  3. Parallel replication changes the picture
  4. What to monitor
  5. FAQ

First two minutes

SHOW REPLICA STATUS\G  -- (or SHOW SLAVE STATUS on 5.7)

-- Look at:
-- Replica_IO_Running, Replica_SQL_Running    (must be Yes)
-- Last_IO_Errno, Last_SQL_Errno              (must be 0)
-- Seconds_Behind_Source                      (the lying integer — but useful as a trend)
-- Replica_SQL_Running_State                  (the SQL thread's current activity)

If Replica_SQL_Running_State is “Reading event from the relay log” but progressing slowly, you’re bottlenecked on apply. If it’s “Waiting for an event from Coordinator” for long stretches, you’re bottlenecked on parallel apply slot allocation.

Nine causes, ranked by frequency

#CauseHow to confirm
1Single-threaded apply on a write-heavy primaryCheck replica_parallel_workers; if 1, that’s your fix
2Long transaction holding a lockReplica side: SHOW PROCESSLIST, look for the apply thread waiting
3DDL replicationSELECT * FROM performance_schema.events_statements_history on replica
4Network bandwidth saturationCompare relay-log growth rate vs. binlog generation rate
5Disk IO saturation on replicaiostat / Performance Schema file_summary_by_instance
6Different binlog_format causing extra workRBS replicating row events while replica needs statement-level
7Large GTID transaction (one statement = whole table)Check Executed_Gtid_Set diff
8Replica running heavy reads (BI queries)Replica QPS > expected from app reads
9Hardware downgrade vs primaryReplica instance class smaller than primary

Parallel replication changes the picture

From 8.0 with replica_parallel_type=LOGICAL_CLOCK and replica_parallel_workers > 1, multiple workers apply non-conflicting transactions in parallel. Setup:

SET GLOBAL replica_parallel_type = 'LOGICAL_CLOCK';
SET GLOBAL replica_parallel_workers = 8;
SET GLOBAL replica_preserve_commit_order = ON;

What to monitor (and what to alert on)

  • Lag in binlog bytes (not seconds) — accurate even when the SQL thread is stuck.
  • Apply throughput in events/s; sudden drop = SQL thread blocked.
  • Worker imbalance from replication_applier_status_by_worker — alert when 95th-percentile worker lag exceeds median by 5×.
  • GTID gaps (Retrieved_Gtid_Set vs Executed_Gtid_Set) for unexpected pauses.

FAQ

Is Seconds_Behind_Source ever accurate?+
When IO and SQL threads are healthy and clocks aren't skewed, yes. Use it for trend, not for debugging.
Should I disable replica_preserve_commit_order for max throughput?+
Only on read-only replicas where reads don't depend on causal consistency. Otherwise keep it on.
GTID vs binlog position — which to track?+
GTID is canonical in modern setups. Binlog position is still useful when comparing relay-log positions on the replica side.

Keep reading

· · ·

Vigila tus bases como vigilas tus servicios.

Reserva una demo de 30 minutos. Especificamos tu flota juntos y cotizamos tu primer trato de 30 días.

Lag de réplica MySQL: 9 causas, ordenadas por frecuencia · Obsfly