ClickHouse
ClickHouse en production : surveiller sans devenir vous-même un hot-spot
system.query_log est énorme. system.parts encore plus. Voici ce qu'il faut vraiment scraper et comment monitorer un cluster ClickHouse sans consommer sa moitié de CPU sur des requêtes système.
ClickHouse’s self-monitoring tables are great until your monitoring tool becomes a top-5 query in the cluster you’re trying to monitor. The trick is selective scraping: full granularity on system.merges and system.replication_queue, sampled granularity on system.query_log, and almost nothing on system.parts.
On this page
What to scrape — and what to skip
- system.metrics + system.events: cheap, gauges + counters. Scrape every 15s.
- system.query_log: expensive at high cardinality. Sample 1 in 100, or aggregate via a materialized view.
- system.merges: read all rows, low frequency. 1Hz is fine.
- system.parts: huge on big tables. Don’t scrape directly. Pre-aggregate via a query that groups by table and is rate-limited.
- system.replication_queue: critical. Alert on rows present for > 5 minutes.
- system.mutations: long-running ALTERs. Alert on
is_done = 0for > 30m.
Merges and mutations: the most-actionable signals
-- How many merges in flight, and how big?
SELECT
table,
count() AS active,
sum(num_parts) AS parts,
formatReadableSize(sum(total_size_bytes_compressed)) AS size,
max(elapsed) AS oldest_sec
FROM system.merges
GROUP BY table
ORDER BY active DESC;Healthy: most merges complete in < 1 minute. Unhealthy: a single merge running for 30+ minutes usually means the part is too large or the host is IO-saturated.
Replication queue health
SELECT
database, table,
count() AS pending,
countIf(num_postponed > 0) AS postponed,
max(num_tries) AS retries,
max(create_time) AS oldest
FROM system.replication_queue
GROUP BY database, table
HAVING pending > 0;Memory and the uncompressed cache
Three memory pools to track:
- OS RSS — RSS minus mark cache minus uncompressed cache, the real working set.
- MarkCacheBytes — column index pages. Should be near
mark_cache_size. - UncompressedCacheBytes — most-skipped because it’s off by default; turn it on for narrow OLAP workloads.
Four anomaly patterns specific to ClickHouse
- Too many parts: When inserts come faster than merges, parts pile up; INSERT throws “Too many parts” at 300/partition. Alert at 200.
- Mutation deadlock: ALTER with long predicate runs forever. Visible as a row in
system.mutationswith highparts_to_do. - Replication divergence:
absolute_delayclimbing on one replica only. - Compression ratio drift: Reduced ratio = changing data shape; usually JSON columns or new high-cardinality dimensions.
FAQ
How do I sample query_log without losing the slow ones?+
Should I monitor system.parts at all?+
ZooKeeper / ClickHouse Keeper monitoring?+
Keep reading
Postgres
Postgres slow queries: 12 causes and how to find each one
A field-tested playbook for diagnosing a slow Postgres query in production — from missing indexes to plan flips to bloated tables — with the SQL to find each cause and the fix.
AI
Anomaly detection on database metrics: why thresholds fail and what works
A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.