MongoDB
Monitoring MongoDB sharded : les métriques qui prédisent le déséquilibre
Distribution des chunks, jumbo chunks, temps de round du balancer, shards chauds. La poignée de métriques qui distingue un cluster sain d'un qui va devoir rééquilibrer.
Sharded MongoDB is a different beast from a replica set. The dashboards you have show per-shard health; the failures usually come from between the shards. Imbalance, jumbo chunks, balancer livelock, and shard-key skew don’t show up on a per-shard graph. Here’s the cluster-level monitoring set you actually need.
Cluster-level metrics
// On any mongos:
db.adminCommand({ balancerStatus: 1 })
sh.status()
db.getSiblingDB("config").chunks.aggregate([
{ $group: { _id: "$shard", chunks: { $sum: 1 } } }
])That last one is the one that surprises teams: the chunk count by shard. Healthy clusters are within ±5%; problematic ones are 30%+ apart and the balancer has given up.
Balancer health
- Round time — balancer rounds typically take seconds. Rounds taking minutes mean a single migration is stuck.
- Migrations failed (last hour) — should be 0; sustained failures = chunk that won’t move (jumbo, write conflict, lock contention).
- Active migrations — only 1 per source shard at a time; if 0 for 24h on an imbalanced cluster, balancer is paused or stuck.
Jumbo chunks
A chunk that exceeds chunkSize (default 128MB) and can’t be split is marked jumbo. The balancer refuses to migrate it. Find them:
db.getSiblingDB("config").chunks.find({ jumbo: true })
.sort({ "min": 1 })
.toArray()The fix depends on cause: bad shard key (the most common — can’t fix without resharding), or a single document > chunk size (rare; consider splitting the document). On 4.4+, refineCollectionShardKey can help by adding suffix fields to make ranges splittable.
Shard-key skew
Hot shard means your shard key concentrates writes onto one range. Test:
// Top 10 shard keys by write rate over the last hour db.serverStatus().opcounters // per-shard, computed via mongotop or collected via metrics scrape // Compare ops/sec across shards. > 2× spread = hot shard.
Recovery options, in increasing intrusiveness: hash existing key, refine the shard key, reshard (online resharding is GA from 5.0). All require planning.
Alerts that matter
- Chunk count spread > 30% across shards
- Balancer failed migrations > 0 sustained 1h
- Any chunk marked jumbo (immediate)
- Per-shard ops/sec spread > 2× median
- Per-shard storage spread > 2× median
- config.chunks rebalance round time > 5min sustained
FAQ
How big should chunkSize be?+
Does Atlas Performance Advisor catch jumbo chunks?+
Can I monitor sharding from the application?+
Keep reading
MongoDB
MongoDB performance monitoring in production: a 2026 guide
Four surfaces (serverStatus, db.stats, currentOp, profiler), a sane default for what to scrape from each, and how to reason about replica lag, oplog window, and aggregation pipeline cost.
AI
Anomaly detection on database metrics: why thresholds fail and what works
A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.