ClickHouse

ClickHouse en production : surveiller sans devenir vous-même un hot-spot

system.query_log est énorme. system.parts encore plus. Voici ce qu'il faut vraiment scraper et comment monitorer un cluster ClickHouse sans consommer sa moitié de CPU sur des requêtes système.

Published 2026-04-20·11 min read

ClickHouse’s self-monitoring tables are great until your monitoring tool becomes a top-5 query in the cluster you’re trying to monitor. The trick is selective scraping: full granularity on system.merges and system.replication_queue, sampled granularity on system.query_log, and almost nothing on system.parts.

On this page

What to scrape (and what to skip)
Merges & mutations
Replication queue health
Memory and uncompressed cache
Four anomaly patterns
FAQ

What to scrape — and what to skip

system.metrics + system.events: cheap, gauges + counters. Scrape every 15s.
system.query_log: expensive at high cardinality. Sample 1 in 100, or aggregate via a materialized view.
system.merges: read all rows, low frequency. 1Hz is fine.
system.parts: huge on big tables. Don’t scrape directly. Pre-aggregate via a query that groups by table and is rate-limited.
system.replication_queue: critical. Alert on rows present for > 5 minutes.
system.mutations: long-running ALTERs. Alert on is_done = 0 for > 30m.

Merges and mutations: the most-actionable signals

-- How many merges in flight, and how big?
SELECT
    table,
    count() AS active,
    sum(num_parts) AS parts,
    formatReadableSize(sum(total_size_bytes_compressed)) AS size,
    max(elapsed) AS oldest_sec
FROM system.merges
GROUP BY table
ORDER BY active DESC;

Healthy: most merges complete in < 1 minute. Unhealthy: a single merge running for 30+ minutes usually means the part is too large or the host is IO-saturated.

Replication queue health

SELECT
    database, table,
    count() AS pending,
    countIf(num_postponed > 0) AS postponed,
    max(num_tries) AS retries,
    max(create_time) AS oldest
FROM system.replication_queue
GROUP BY database, table
HAVING pending > 0;

Memory and the uncompressed cache

Three memory pools to track:

OS RSS — RSS minus mark cache minus uncompressed cache, the real working set.
MarkCacheBytes — column index pages. Should be near mark_cache_size.
UncompressedCacheBytes — most-skipped because it’s off by default; turn it on for narrow OLAP workloads.

Four anomaly patterns specific to ClickHouse

Too many parts: When inserts come faster than merges, parts pile up; INSERT throws “Too many parts” at 300/partition. Alert at 200.
Mutation deadlock: ALTER with long predicate runs forever. Visible as a row in system.mutations with high parts_to_do.
Replication divergence: absolute_delay climbing on one replica only.
Compression ratio drift: Reduced ratio = changing data shape; usually JSON columns or new high-cardinality dimensions.

FAQ

How do I sample query_log without losing the slow ones?+

Materialized view filtering on query_duration_ms > 100 — captures every slow query without sampling, and a 1% sample of fast queries for distributions.

Should I monitor system.parts at all?+

Yes — but pre-aggregated: count and total_bytes by table, not per-row. A direct SELECT * on system.parts of a busy table is itself a 30-second query.

ZooKeeper / ClickHouse Keeper monitoring?+

Track session count, watch count, and request latency. Keeper-specific exporters are more reliable than scraping ClickHouse for ZK status.

Keep reading

Postgres

Postgres slow queries: 12 causes and how to find each one

A field-tested playbook for diagnosing a slow Postgres query in production — from missing indexes to plan flips to bloated tables — with the SQL to find each cause and the fix.

Anomaly detection on database metrics: why thresholds fail and what works

A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.

← All posts