Redis

Redis-Monitoring in Produktion: der Leitfaden 2026

INFO, slowlog, latency monitor, keyspace notifications, big-key sampling — was zu scrapen ist und die acht Metriken, die jeden Redis-Vorfall vorhersagen.

Published 2026-04-26·12 min read

Redis is fast enough that teams ship it with default monitoring and forget about it for years. Then one Saturday it’s slow and you have ten minutes to identify why — out of > 200 fields in INFO. This post gives you a working set of probes for Redis production.

On this page

The five surfaces
INFO: what to scrape
SLOWLOG: tune it once
Latency monitor
Memory fragmentation
Replication & cluster
Eight alerts that matter
FAQ

The five monitoring surfaces

INFO: 13 sections, ~200 fields. Scraped at 1Hz it’s your primary feed.
SLOWLOG: command-level slow log. Tune length + threshold; otherwise the default 128 entries is useless.
LATENCY: subsystem-level latency events. Detects fork/AOF/expire-cycle pauses INFO can’t.
CLIENT LIST: who’s connected, their input/output buffer sizes. Catches runaway pipelines.
--bigkeys / --hotkeys: sampled scans for working-set anomalies. Run weekly, not at incident time.

INFO: what to actually scrape

You don’t need 200 fields. The 14 that predict everything:

Field	Why it matters
`used_memory_rss`	Resident memory. If RSS / used_memory > 1.5, you have fragmentation.
`mem_fragmentation_ratio`	Same thing pre-computed. Alert > 1.5 sustained.
`connected_clients`	Should be flat. Trend up = leak.
`blocked_clients`	Watching BLPOP / WAIT. > 10 sustained = saturating.
`instantaneous_ops_per_sec`	QPS. Combined with latency, your top-line.
`keyspace_hits / misses`	Hit ratio. Falling = working set exceeds memory.
`evicted_keys`	Should be 0 unless you set maxmemory. Spike = evicting hot data.
`expired_keys`	Trend up = TTL pile-up. Combined with CPU spike = expire cycle stall.
`master_link_status`
`master_repl_offset`	Replication offset. Stalled = drift.
`aof_current_size`	AOF growth. Grows faster than rewrite = bg rewrite stuck.
`rdb_bgsave_in_progress`	1 = forking. Watch for fork-time spikes on big instances.
`latest_fork_usec`	Last fork duration. > 500ms on big instance = problem.
`cluster_state`	Cluster mode. 'fail' = node missing. Page immediately.

SLOWLOG: tune it once and you have a record

Defaults are 10ms threshold and 128-entry ring buffer. On a busy instance the 128 entries cycle in seconds — useless. Set:

# Capture commands slower than 1ms; keep 1024 of them
CONFIG SET slowlog-log-slower-than 1000
CONFIG SET slowlog-max-len 1024

# Read the most recent N
SLOWLOG GET 50

# Reset after analysis
SLOWLOG RESET

LATENCY monitor: catches what SLOWLOG can’t

Different beast. SLOWLOG records commands; LATENCY records events — fast-command, fork, aof-stat, expire-cycle. Enable with CONFIG SET latency-monitor-threshold 100; query with LATENCY HISTORY fork.

A common production incident: AOF rewrite forks, fork takes 1.5s on a 30 GB instance, every command during that 1.5s sees the latency. SLOWLOG shows 1000s of slow commands but no obvious cause; LATENCY shows the single fork event that’s the actual story.

Memory fragmentation: the silent killer

Redis allocates with jemalloc. Long-lived workloads with mixed key sizes fragment over weeks until used_memory_rss is 2× used_memory and you OOM your host. Two early-warning signals:

mem_fragmentation_ratio drift > 1.4 sustained for 24h.
active_defrag_running = 1 most of the time means defrag can’t keep up; raise active-defrag-cycle-max.

Replication and cluster monitoring

For Sentinel: alert on +sdown and +odown events. For Cluster:

Alert on cluster_state != ok immediately.
Track cluster_slots_assigned; should equal 16384 always.
Watch cluster_known_nodes; an unexpected drop means a node was forgotten.

Eight alerts every Redis deployment should ship with

memory_used / maxmemory > 80% for 5m
mem_fragmentation_ratio > 1.5 for 1h
evicted_keys delta > 0 for instances that should not be evicting
master_link_status = down for 10s on a replica
latest_fork_usec > 500000 (bg-saves are pausing the world)
blocked_clients > 50 sustained
connected_clients trending up > 10% / day
cluster_state != ok (immediate)

FAQ

Should I use Prometheus's redis_exporter?+

Fine for ~30 instances. At scale the per-instance HTTP scrape adds up; prefer an agent that reads INFO once and emits all metrics in one push.

What about Redis Sentinel monitoring?+

Subscribe to the +sdown / +odown / +switch-master pub/sub channels. Don't poll — events are the source of truth.

Active-defrag — safe to enable?+

Yes on Redis 7+. The cycle parameters keep it bounded; default 1-25% CPU is conservative.

Keep reading

Redis

Redis SLOWLOG: the misunderstood telemetry that catches half your incidents

Most teams ship Redis with default SLOWLOG settings and never look at it. Here's how to tune it, what to scrape from it, and the three Redis incident classes that only show up in SLOWLOG.

Anomaly detection on database metrics: why thresholds fail and what works

A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.

← All posts