Obsfly
mongodb / sharding · 4 shards · 525 chunkslivemongos · routershard-01rs0:27017142 chunksshard-02rs1:27017138 chunksshard-03rs2:27017156 chunksshard-04rs3:2701789 chunks

MongoDB

分片 MongoDB 监控:预测不均衡的关键指标

Chunk 分布、jumbo chunks、balancer 轮次时间、热点分片 — 区分健康集群和即将需要再平衡的集群,关键就这几个指标。

Published ·12 min read

Sharded MongoDB is a different beast from a replica set. The dashboards you have show per-shard health; the failures usually come from between the shards. Imbalance, jumbo chunks, balancer livelock, and shard-key skew don’t show up on a per-shard graph. Here’s the cluster-level monitoring set you actually need.

On this page
  1. Cluster-level metrics
  2. Balancer health
  3. Jumbo chunks
  4. Shard-key skew
  5. Alerts
  6. FAQ

Cluster-level metrics

// On any mongos:
db.adminCommand({ balancerStatus: 1 })
sh.status()
db.getSiblingDB("config").chunks.aggregate([
  { $group: { _id: "$shard", chunks: { $sum: 1 } } }
])

That last one is the one that surprises teams: the chunk count by shard. Healthy clusters are within ±5%; problematic ones are 30%+ apart and the balancer has given up.

Balancer health

  • Round time — balancer rounds typically take seconds. Rounds taking minutes mean a single migration is stuck.
  • Migrations failed (last hour) — should be 0; sustained failures = chunk that won’t move (jumbo, write conflict, lock contention).
  • Active migrations — only 1 per source shard at a time; if 0 for 24h on an imbalanced cluster, balancer is paused or stuck.

Jumbo chunks

A chunk that exceeds chunkSize (default 128MB) and can’t be split is marked jumbo. The balancer refuses to migrate it. Find them:

db.getSiblingDB("config").chunks.find({ jumbo: true })
  .sort({ "min": 1 })
  .toArray()

The fix depends on cause: bad shard key (the most common — can’t fix without resharding), or a single document > chunk size (rare; consider splitting the document). On 4.4+, refineCollectionShardKey can help by adding suffix fields to make ranges splittable.

Shard-key skew

Hot shard means your shard key concentrates writes onto one range. Test:

// Top 10 shard keys by write rate over the last hour
db.serverStatus().opcounters
// per-shard, computed via mongotop or collected via metrics scrape

// Compare ops/sec across shards. > 2× spread = hot shard.

Recovery options, in increasing intrusiveness: hash existing key, refine the shard key, reshard (online resharding is GA from 5.0). All require planning.

Alerts that matter

  • Chunk count spread > 30% across shards
  • Balancer failed migrations > 0 sustained 1h
  • Any chunk marked jumbo (immediate)
  • Per-shard ops/sec spread > 2× median
  • Per-shard storage spread > 2× median
  • config.chunks rebalance round time > 5min sustained

FAQ

How big should chunkSize be?+
128 MB is the default. Raising to 256 MB reduces migration churn at the cost of bigger jumps when migrations do happen. Don't drop below 64 MB without good reason.
Does Atlas Performance Advisor catch jumbo chunks?+
No. PA is shard-local. Cluster-wide signals require explicit instrumentation.
Can I monitor sharding from the application?+
Indirectly — track per-shard latency on a representative read. Drift on one shard signals balancer or capacity issues before the cluster-level metrics catch up.

Keep reading

· · ·

像监控服务一样监控你的数据库。

预约 30 分钟演示。我们一起规划你的数据库规模,并报出第一个 30 天合作的报价。

分片 MongoDB 监控:预测不均衡的关键指标 · Obsfly