Redis
本番Redis監視:2026年ガイド
INFO、slowlog、latency monitor、keyspace notifications、big-keyサンプリング — 何を取得し、8つのメトリクスで全インシデントを予測する。
Redis is fast enough that teams ship it with default monitoring and forget about it for years. Then one Saturday it’s slow and you have ten minutes to identify why — out of > 200 fields in INFO. This post gives you a working set of probes for Redis production.
On this page
The five monitoring surfaces
- INFO: 13 sections, ~200 fields. Scraped at 1Hz it’s your primary feed.
- SLOWLOG: command-level slow log. Tune length + threshold; otherwise the default 128 entries is useless.
- LATENCY: subsystem-level latency events. Detects fork/AOF/expire-cycle pauses INFO can’t.
- CLIENT LIST: who’s connected, their input/output buffer sizes. Catches runaway pipelines.
- --bigkeys / --hotkeys: sampled scans for working-set anomalies. Run weekly, not at incident time.
INFO: what to actually scrape
You don’t need 200 fields. The 14 that predict everything:
| Field | Why it matters |
|---|---|
used_memory_rss | Resident memory. If RSS / used_memory > 1.5, you have fragmentation. |
mem_fragmentation_ratio | Same thing pre-computed. Alert > 1.5 sustained. |
connected_clients | Should be flat. Trend up = leak. |
blocked_clients | Watching BLPOP / WAIT. > 10 sustained = saturating. |
instantaneous_ops_per_sec | QPS. Combined with latency, your top-line. |
keyspace_hits / misses | Hit ratio. Falling = working set exceeds memory. |
evicted_keys | Should be 0 unless you set maxmemory. Spike = evicting hot data. |
expired_keys | Trend up = TTL pile-up. Combined with CPU spike = expire cycle stall. |
master_link_status | |
master_repl_offset | Replication offset. Stalled = drift. |
aof_current_size | AOF growth. Grows faster than rewrite = bg rewrite stuck. |
rdb_bgsave_in_progress | 1 = forking. Watch for fork-time spikes on big instances. |
latest_fork_usec | Last fork duration. > 500ms on big instance = problem. |
cluster_state | Cluster mode. 'fail' = node missing. Page immediately. |
SLOWLOG: tune it once and you have a record
Defaults are 10ms threshold and 128-entry ring buffer. On a busy instance the 128 entries cycle in seconds — useless. Set:
# Capture commands slower than 1ms; keep 1024 of them CONFIG SET slowlog-log-slower-than 1000 CONFIG SET slowlog-max-len 1024 # Read the most recent N SLOWLOG GET 50 # Reset after analysis SLOWLOG RESET
LATENCY monitor: catches what SLOWLOG can’t
Different beast. SLOWLOG records commands; LATENCY records events — fast-command, fork, aof-stat, expire-cycle. Enable with CONFIG SET latency-monitor-threshold 100; query with LATENCY HISTORY fork.
A common production incident: AOF rewrite forks, fork takes 1.5s on a 30 GB instance, every command during that 1.5s sees the latency. SLOWLOG shows 1000s of slow commands but no obvious cause; LATENCY shows the single fork event that’s the actual story.
Memory fragmentation: the silent killer
Redis allocates with jemalloc. Long-lived workloads with mixed key sizes fragment over weeks until used_memory_rss is 2× used_memory and you OOM your host. Two early-warning signals:
mem_fragmentation_ratiodrift > 1.4 sustained for 24h.active_defrag_running= 1 most of the time means defrag can’t keep up; raiseactive-defrag-cycle-max.
Replication and cluster monitoring
For Sentinel: alert on +sdown and +odown events. For Cluster:
- Alert on
cluster_state != okimmediately. - Track
cluster_slots_assigned; should equal 16384 always. - Watch
cluster_known_nodes; an unexpected drop means a node was forgotten.
Eight alerts every Redis deployment should ship with
- memory_used / maxmemory > 80% for 5m
- mem_fragmentation_ratio > 1.5 for 1h
- evicted_keys delta > 0 for instances that should not be evicting
- master_link_status = down for 10s on a replica
- latest_fork_usec > 500000 (bg-saves are pausing the world)
- blocked_clients > 50 sustained
- connected_clients trending up > 10% / day
- cluster_state != ok (immediate)
FAQ
Should I use Prometheus's redis_exporter?+
What about Redis Sentinel monitoring?+
Active-defrag — safe to enable?+
Keep reading
Redis
Redis SLOWLOG: the misunderstood telemetry that catches half your incidents
Most teams ship Redis with default SLOWLOG settings and never look at it. Here's how to tune it, what to scrape from it, and the three Redis incident classes that only show up in SLOWLOG.
AI
Anomaly detection on database metrics: why thresholds fail and what works
A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.