Obsfly

ClickHouse Monitoring

ClickHouse monitoring that knows about parts.

Obsfly scrapes system.query_log, system.parts, system.merges, system.mutations and system.replication_queue — giving you the metrics ClickHouse-aware operators actually want.

Why monitor ClickHouse

ClickHouse has unique pathologies — too-many-parts errors, slow merges, distributed query stuck on a single shard. Generic DBM tools miss them. Obsfly ships ClickHouse-native metrics out of the box.

What we scrape

Obsfly reads ClickHouse through the surfaces operators already know. No driver changes, no extensions installed by us, no agent on the database itself.

system.query_log

Per-query execution: type, query_kind, duration, read/written rows, memory usage.

system.parts / system.parts_columns

Active part counts per table, total bytes, granule count.

system.merges / system.mutations

In-flight merges, mutation backlog, per-table merge throughput.

system.replication_queue

Replicated table operations queued, errors, blocked merges.

system.metrics / system.events / system.asynchronous_metrics

Counters for everything from compressed bytes to file descriptors.

system.processes

Live queries, memory consumption, elapsed time.

Key metrics tracked

Query latency p50/p95/p99 by query_kind
Splits Select/Insert/Alter to surface workload-specific regressions.
Parts per table
ClickHouse has a 'too many parts' tipping point — alert before 300+ parts.
Merge backlog
Pending merges queued; if growing, you have a write-rate problem.
Replication queue depth
Per replica, with forecast.
Memory per query
system.processes peak memory; alert on OOM-risk queries.
ZooKeeper / Keeper latency
ReplicatedMergeTree depends on it; lag here cascades.

Common ClickHouse pains, and how Obsfly surfaces each

'Too many parts' errors blocking inserts

Sign

Insert fails with TOO_MANY_PARTS; system.parts shows 300+ active parts on the table.

Fix

Batch inserts to fewer, larger writes. Tune background_pool_size. Consider min_bytes_for_wide_part.

Slow merges, growing parts count

Sign

system.merges has long-running merges; parts count climbs week-over-week.

Fix

Storage IO ceiling, or merge thread starvation. Inspect background_pool_size and max_bytes_to_merge_at_*.

Replicated table stuck

Sign

system.replication_queue shows operations with last_exception set.

Fix

Inspect the exception. Common: schema mismatch between replicas, ZooKeeper quota exhausted.

vs Datadog DBM for ClickHouse

Datadog supports ClickHouse via OpenMetrics, but lacks parts-aware alerts and merge-backlog forecasting. Obsfly ships ClickHouse-native dashboards and the alerts that match the database's actual pathologies.
Full Datadog DBM comparison →

FAQ

Self-hosted, ClickHouse Cloud, Altinity — all supported?+

Yes. The agent connects via the native protocol and reads the system database. Cloud and Altinity expose the same system tables.

Obsfly runs on ClickHouse — do you eat your own?+

Yes. Our internal observability for the Obsfly data plane is Obsfly scraping its own ClickHouse. The integration is hardened by us using it 24/7.

· · ·

See Obsfly on your ClickHouse.

20-min demo. We connect to a sample ClickHouse on the call and reproduce your slowest query in the tool.

ClickHouse monitoring — query log, parts, merges, replication · Obsfly