MongoDB

分片 MongoDB 监控:预测不均衡的关键指标

Chunk 分布、jumbo chunks、balancer 轮次时间、热点分片 — 区分健康集群和即将需要再平衡的集群,关键就这几个指标。

Published 2026-04-06·12 min read

Sharded MongoDB is a different beast from a replica set. The dashboards you have show per-shard health; the failures usually come from between the shards. Imbalance, jumbo chunks, balancer livelock, and shard-key skew don’t show up on a per-shard graph. Here’s the cluster-level monitoring set you actually need.

On this page

Cluster-level metrics
Balancer health
Jumbo chunks
Shard-key skew
Alerts
FAQ

Cluster-level metrics

// On any mongos:
db.adminCommand({ balancerStatus: 1 })
sh.status()
db.getSiblingDB("config").chunks.aggregate([
  { $group: { _id: "$shard", chunks: { $sum: 1 } } }
])

That last one is the one that surprises teams: the chunk count by shard. Healthy clusters are within ±5%; problematic ones are 30%+ apart and the balancer has given up.

Balancer health

Round time — balancer rounds typically take seconds. Rounds taking minutes mean a single migration is stuck.
Migrations failed (last hour) — should be 0; sustained failures = chunk that won’t move (jumbo, write conflict, lock contention).
Active migrations — only 1 per source shard at a time; if 0 for 24h on an imbalanced cluster, balancer is paused or stuck.

Jumbo chunks

A chunk that exceeds chunkSize (default 128MB) and can’t be split is marked jumbo. The balancer refuses to migrate it. Find them:

db.getSiblingDB("config").chunks.find({ jumbo: true })
  .sort({ "min": 1 })
  .toArray()

The fix depends on cause: bad shard key (the most common — can’t fix without resharding), or a single document > chunk size (rare; consider splitting the document). On 4.4+, refineCollectionShardKey can help by adding suffix fields to make ranges splittable.

Shard-key skew

Hot shard means your shard key concentrates writes onto one range. Test:

// Top 10 shard keys by write rate over the last hour
db.serverStatus().opcounters
// per-shard, computed via mongotop or collected via metrics scrape

// Compare ops/sec across shards. > 2× spread = hot shard.

Recovery options, in increasing intrusiveness: hash existing key, refine the shard key, reshard (online resharding is GA from 5.0). All require planning.

Alerts that matter

Chunk count spread > 30% across shards
Balancer failed migrations > 0 sustained 1h
Any chunk marked jumbo (immediate)
Per-shard ops/sec spread > 2× median
Per-shard storage spread > 2× median
config.chunks rebalance round time > 5min sustained

FAQ

How big should chunkSize be?+

128 MB is the default. Raising to 256 MB reduces migration churn at the cost of bigger jumps when migrations do happen. Don't drop below 64 MB without good reason.

Does Atlas Performance Advisor catch jumbo chunks?+

No. PA is shard-local. Cluster-wide signals require explicit instrumentation.

Can I monitor sharding from the application?+

Indirectly — track per-shard latency on a representative read. Drift on one shard signals balancer or capacity issues before the cluster-level metrics catch up.

Keep reading

MongoDB

MongoDB performance monitoring in production: a 2026 guide

Four surfaces (serverStatus, db.stats, currentOp, profiler), a sane default for what to scrape from each, and how to reason about replica lag, oplog window, and aggregation pipeline cost.

Anomaly detection on database metrics: why thresholds fail and what works

A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.

← All posts