Elasticsearch Monitoring

Elasticsearch monitoring that watches the JVM heap before it falls.

Obsfly scrapes _cluster/health, _cluster/stats, _nodes/stats, and the slow log every 15s — surfacing heap pressure, GC pauses, indexing throughput, and search latency tails across every node.

Book a Elasticsearch demo vs Datadog DBM

Why monitor Elasticsearch

Elasticsearch in production is mostly a JVM-tuning game with shard-allocation politics on top. The metrics that matter — heap pressure, GC time, indexing back-pressure, queue overflow — are buried in node stats. Obsfly surfaces them.

What we scrape

Obsfly reads Elasticsearch through the surfaces operators already know. No driver changes, no extensions installed by us, no agent on the database itself.

_cluster/health

Cluster status (green/yellow/red), unassigned shards, pending tasks.

_cluster/stats

Total shards, indices, fielddata size, query/fetch latency.

_nodes/stats

Per-node JVM heap, GC, thread pools, HTTP, transport.

Slow log (settings index.search.slowlog.*)

Slow searches and indexes captured per request.

_cat/shards / _cat/recovery

Shard placement and recovery state.

Key metrics tracked

JVM heap used %

> 75% sustained means GC pressure; > 85% means trouble.

GC old / young pause time

Pauses > 1s starve query latency.

Search latency p99 per index

From _nodes/stats search.fetch_time and slow log.

Indexing rate vs queue capacity

Bulk thread pool queue depth and rejections.

Unassigned shards

Cluster yellow → red in 1 metric.

Fielddata size / circuit breaker

Old-style fielddata can OOM nodes.

Common Elasticsearch pains, and how Obsfly surfaces each

Old-gen GC pauses spiking

Sign

Old-gen GC count growing; pause time > 1s; heap usage stays high after collection.

Fix

Heap is too small or fielddata bloat. Increase heap (max 30.5 GB for compressed oops), or migrate to doc_values.

Indexing rate drops under load

Sign

bulk thread pool rejections climb; queue depth saturated.

Fix

Increase queue size (cautiously). Better: shard your indices more, or switch to time-series data streams (TSDS).

Unassigned shards stay yellow/red

Sign

_cluster/health shows unassigned > 0; _cat/shards shows reason.

Fix

Allocation explain API: GET /_cluster/allocation/explain. Common causes: disk watermark exceeded, allocation filter mismatch.

vs Datadog DBM for Elasticsearch

Datadog Elasticsearch ships node-stats scraping. Obsfly adds shard-level allocation history, slow-log structured parsing per index, and JVM heap forecast bands — predicting OOM hours ahead.

Full Datadog DBM comparison →

Obsfly features for Elasticsearch

Feature

Query Summary

Top-N normalized queries with p50 / p95 / p99 latency, QPS, total time, rows touched, and plan-change history.

Feature

Anomaly Detection

ML-driven anomaly detection on every metric. Forecast bands, change-point detection, no thresholds to tune.

Feature

Forecast

Capacity forecasts for QPS, IOPS, storage, connections — predict outages weeks ahead.

FAQ

OpenSearch supported?+

Yes — same APIs and JVM. Both Elasticsearch (OSS and Elastic.co's commercial) and OpenSearch.

Versions?+

Elasticsearch 7.x, 8.x, 9.x. OpenSearch 1.x, 2.x, 3.x. Older 6.x works with reduced detail.

Deep dives on Elasticsearch

Anomaly detection on database metrics: why thresholds fail and what works

A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.

· · ·

See Obsfly on your Elasticsearch.

20-min demo. We connect to a sample Elasticsearch on the call and reproduce your slowest query in the tool.

Book a demo Read the docs