Obsfly
redis / INFO · liveliveused_memory12.4 GB / 16 GBops_per_sec84,200evicted_keys1,240/skeyspace_misses8.4%connected_clients1,820 / 10kblocked_clients0

Redis

Redis-Monitoring in Produktion: der Leitfaden 2026

INFO, slowlog, latency monitor, keyspace notifications, big-key sampling — was zu scrapen ist und die acht Metriken, die jeden Redis-Vorfall vorhersagen.

Published ·12 min read

Redis is fast enough that teams ship it with default monitoring and forget about it for years. Then one Saturday it’s slow and you have ten minutes to identify why — out of > 200 fields in INFO. This post gives you a working set of probes for Redis production.

On this page
  1. The five surfaces
  2. INFO: what to scrape
  3. SLOWLOG: tune it once
  4. Latency monitor
  5. Memory fragmentation
  6. Replication & cluster
  7. Eight alerts that matter
  8. FAQ

The five monitoring surfaces

  • INFO: 13 sections, ~200 fields. Scraped at 1Hz it’s your primary feed.
  • SLOWLOG: command-level slow log. Tune length + threshold; otherwise the default 128 entries is useless.
  • LATENCY: subsystem-level latency events. Detects fork/AOF/expire-cycle pauses INFO can’t.
  • CLIENT LIST: who’s connected, their input/output buffer sizes. Catches runaway pipelines.
  • --bigkeys / --hotkeys: sampled scans for working-set anomalies. Run weekly, not at incident time.

INFO: what to actually scrape

You don’t need 200 fields. The 14 that predict everything:

FieldWhy it matters
used_memory_rssResident memory. If RSS / used_memory > 1.5, you have fragmentation.
mem_fragmentation_ratioSame thing pre-computed. Alert > 1.5 sustained.
connected_clientsShould be flat. Trend up = leak.
blocked_clientsWatching BLPOP / WAIT. > 10 sustained = saturating.
instantaneous_ops_per_secQPS. Combined with latency, your top-line.
keyspace_hits / missesHit ratio. Falling = working set exceeds memory.
evicted_keysShould be 0 unless you set maxmemory. Spike = evicting hot data.
expired_keysTrend up = TTL pile-up. Combined with CPU spike = expire cycle stall.
master_link_status
master_repl_offsetReplication offset. Stalled = drift.
aof_current_sizeAOF growth. Grows faster than rewrite = bg rewrite stuck.
rdb_bgsave_in_progress1 = forking. Watch for fork-time spikes on big instances.
latest_fork_usecLast fork duration. > 500ms on big instance = problem.
cluster_stateCluster mode. 'fail' = node missing. Page immediately.

SLOWLOG: tune it once and you have a record

Defaults are 10ms threshold and 128-entry ring buffer. On a busy instance the 128 entries cycle in seconds — useless. Set:

# Capture commands slower than 1ms; keep 1024 of them
CONFIG SET slowlog-log-slower-than 1000
CONFIG SET slowlog-max-len 1024

# Read the most recent N
SLOWLOG GET 50

# Reset after analysis
SLOWLOG RESET

LATENCY monitor: catches what SLOWLOG can’t

Different beast. SLOWLOG records commands; LATENCY records events — fast-command, fork, aof-stat, expire-cycle. Enable with CONFIG SET latency-monitor-threshold 100; query with LATENCY HISTORY fork.

A common production incident: AOF rewrite forks, fork takes 1.5s on a 30 GB instance, every command during that 1.5s sees the latency. SLOWLOG shows 1000s of slow commands but no obvious cause; LATENCY shows the single fork event that’s the actual story.

Memory fragmentation: the silent killer

Redis allocates with jemalloc. Long-lived workloads with mixed key sizes fragment over weeks until used_memory_rss is 2× used_memory and you OOM your host. Two early-warning signals:

  • mem_fragmentation_ratio drift > 1.4 sustained for 24h.
  • active_defrag_running = 1 most of the time means defrag can’t keep up; raise active-defrag-cycle-max.

Replication and cluster monitoring

For Sentinel: alert on +sdown and +odown events. For Cluster:

  • Alert on cluster_state != ok immediately.
  • Track cluster_slots_assigned; should equal 16384 always.
  • Watch cluster_known_nodes; an unexpected drop means a node was forgotten.

Eight alerts every Redis deployment should ship with

  • memory_used / maxmemory > 80% for 5m
  • mem_fragmentation_ratio > 1.5 for 1h
  • evicted_keys delta > 0 for instances that should not be evicting
  • master_link_status = down for 10s on a replica
  • latest_fork_usec > 500000 (bg-saves are pausing the world)
  • blocked_clients > 50 sustained
  • connected_clients trending up > 10% / day
  • cluster_state != ok (immediate)

FAQ

Should I use Prometheus's redis_exporter?+
Fine for ~30 instances. At scale the per-instance HTTP scrape adds up; prefer an agent that reads INFO once and emits all metrics in one push.
What about Redis Sentinel monitoring?+
Subscribe to the +sdown / +odown / +switch-master pub/sub channels. Don't poll — events are the source of truth.
Active-defrag — safe to enable?+
Yes on Redis 7+. The cycle parameters keep it bounded; default 1-25% CPU is conservative.

Keep reading

· · ·

Überwache deine Datenbanken wie deine Services.

Buche eine 30-minütige Demo. Wir besprechen deine Flotte und erstellen ein 30-Tage-Angebot.

Redis-Monitoring in Produktion: der Leitfaden 2026 · Obsfly