SRE

SLOs de DB que no son inútiles: una definición práctica

La mayoría de SLOs de DB son 'CPU bajo 80%'. Eso es una alerta presupuestaria, no un objetivo de servicio. Aquí, cómo definir un SLO que un ejecutivo firma y un ingeniero ejecuta.

Published 2026-04-14·9 min read

“CPU under 80%” isn’t an SLO. It’s a budget alert. A real database SLO has three parts — Service Level Indicator, target, and error budget — and exists to drive decisions, not to sit on a Confluence page.

On this page

What an SLO actually is
Concrete examples by workload
Using the error budget
Anti-patterns
FAQ

SLI, SLO, error budget

SLI — a metric users notice, expressed as a fraction. “Of all reads against the orders DB in the last hour, what fraction returned in < 100ms?”
SLO — a target for the SLI over a window. “99.5% of reads complete in < 100ms over a 30-day window.”
Error budget — what’s left to burn. If your SLO is 99.5%, your budget is 0.5% × hours/month = 3.6 hours of failure-budget per month.

Concrete examples by workload

# OLTP read-heavy
99.5% of SELECTs against the orders DB return in < 100ms,
   over a rolling 30-day window.

# OLTP write
99.0% of INSERTs into orders complete in < 250ms,
   over a rolling 30-day window.

# Analytics
95% of dashboard queries return in < 5s,
   measured per-query, rolled up over 7 days.

# Queue-style
99.9% of rows enqueued are picked up within 5s,
   measured by enqueue → first-fetch latency.

# Replication
99.5% of read-replica queries see data committed within 100ms of master,
   measured per-statement-level lag.

Using the error budget

The error budget is what makes an SLO operational. Three concrete uses:

Decision lever in incident reviews — if you’ve burned 80% of your budget mid-month, the next risky migration waits.
Change-management gate — releases pause when budget is exhausted, ship freely when it’s healthy.
Capacity prioritization — adjacent SLOs all healthy + this one bleeding = scale or tune this one first.

Anti-patterns

The 100% SLO — math says you can’t hit it. Every minute of natural variance is an SLO violation. Pick 99.5 or 99.9, not 100.
The infrastructure SLI — “CPU < 80%” isn’t an SLI. Users don’t feel CPU; they feel latency or errors.
The single-window SLO — only measuring monthly hides bad weeks. Roll up at multiple windows (1h, 24h, 7d, 30d) and alert on multi-window burn rates.
The SLO nobody owns — if it doesn’t name a team and a dashboard, it doesn’t exist.

FAQ

How tight should the SLO target be?+

Tight enough that the team would actually pause work to defend it; loose enough that natural variance doesn't trigger it. Start at 99.5% and adjust on review.

Multi-window burn rate alerting?+

Yes — Google's SRE workbook chapter on burn rate is the canonical reference. Alert when the budget will be exhausted in <2h based on current burn.

Per-tenant SLOs?+

Useful for B2B platforms with named-account SLAs. Keep them as derivatives of fleet SLOs, not parallel definitions.

Keep reading

Postgres

Why your Postgres p99 latency lies — and what to track instead

p99 over 1m windows is the most-displayed and most-misleading number on every DBM dashboard. Here's the histogram math, the seasonality math, and a saner default.

Anomaly detection on database metrics: why thresholds fail and what works

A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.

← All posts