Obsfly
slo / 30-day budget · 4 indicatorslivep99 query latency < 200ms64% burntavailability ≥ 99.9%18% burntconnection error rate < 0.1%92% burntreplica lag < 5s42% burnt

SRE

SLOs de DB que no son inútiles: una definición práctica

La mayoría de SLOs de DB son 'CPU bajo 80%'. Eso es una alerta presupuestaria, no un objetivo de servicio. Aquí, cómo definir un SLO que un ejecutivo firma y un ingeniero ejecuta.

Published ·9 min read

“CPU under 80%” isn’t an SLO. It’s a budget alert. A real database SLO has three parts — Service Level Indicator, target, and error budget — and exists to drive decisions, not to sit on a Confluence page.

On this page
  1. What an SLO actually is
  2. Concrete examples by workload
  3. Using the error budget
  4. Anti-patterns
  5. FAQ

SLI, SLO, error budget

  • SLI — a metric users notice, expressed as a fraction. “Of all reads against the orders DB in the last hour, what fraction returned in < 100ms?”
  • SLO — a target for the SLI over a window. “99.5% of reads complete in < 100ms over a 30-day window.”
  • Error budget — what’s left to burn. If your SLO is 99.5%, your budget is 0.5% × hours/month = 3.6 hours of failure-budget per month.

Concrete examples by workload

# OLTP read-heavy
99.5% of SELECTs against the orders DB return in < 100ms,
   over a rolling 30-day window.

# OLTP write
99.0% of INSERTs into orders complete in < 250ms,
   over a rolling 30-day window.

# Analytics
95% of dashboard queries return in < 5s,
   measured per-query, rolled up over 7 days.

# Queue-style
99.9% of rows enqueued are picked up within 5s,
   measured by enqueue → first-fetch latency.

# Replication
99.5% of read-replica queries see data committed within 100ms of master,
   measured per-statement-level lag.

Using the error budget

The error budget is what makes an SLO operational. Three concrete uses:

  • Decision lever in incident reviews — if you’ve burned 80% of your budget mid-month, the next risky migration waits.
  • Change-management gate — releases pause when budget is exhausted, ship freely when it’s healthy.
  • Capacity prioritization — adjacent SLOs all healthy + this one bleeding = scale or tune this one first.

Anti-patterns

  • The 100% SLO — math says you can’t hit it. Every minute of natural variance is an SLO violation. Pick 99.5 or 99.9, not 100.
  • The infrastructure SLI — “CPU < 80%” isn’t an SLI. Users don’t feel CPU; they feel latency or errors.
  • The single-window SLO — only measuring monthly hides bad weeks. Roll up at multiple windows (1h, 24h, 7d, 30d) and alert on multi-window burn rates.
  • The SLO nobody owns — if it doesn’t name a team and a dashboard, it doesn’t exist.

FAQ

How tight should the SLO target be?+
Tight enough that the team would actually pause work to defend it; loose enough that natural variance doesn't trigger it. Start at 99.5% and adjust on review.
Multi-window burn rate alerting?+
Yes — Google's SRE workbook chapter on burn rate is the canonical reference. Alert when the budget will be exhausted in <2h based on current burn.
Per-tenant SLOs?+
Useful for B2B platforms with named-account SLAs. Keep them as derivatives of fleet SLOs, not parallel definitions.

Keep reading

· · ·

Vigila tus bases como vigilas tus servicios.

Reserva una demo de 30 minutos. Especificamos tu flota juntos y cotizamos tu primer trato de 30 días.

SLOs de DB que no son inútiles: una definición práctica · Obsfly