AI for database query optimization: what's real in 2026 (and what's not)

Two years of shipping LLM-grounded query analysis to production databases. What AI is genuinely good at, what it's bad at, why grounding beats model size, and how BYO LLM works in regulated deployments.

Published 2026-04-20·9 min read

Every observability vendor has “AI” on the slide deck in 2026. Most of it is marketing. Some of it is genuinely useful. This post is about telling the difference, and about what we’ve learned shipping LLM-grounded query analysis to production databases.

On this page

What AI is good at, in DB observability
What it's bad at
Why grounding matters more than the model
BYO LLM in regulated deployments
FAQ

What AI is good at, in DB observability

After two years of shipping AI features against real Postgres / MySQL workloads, the useful applications cluster into three categories:

Plan narration. Given an EXPLAIN ANALYZE plan and a schema, an LLM can explain why a particular plan is slow, in plain language a senior engineer would write. The signal-to-noise on this is high because the plan is structured input — there’s no hallucination space.
Index recommendations grounded in the workload. Given the actual queries from pg_stat_statements and current indexes, an LLM can propose new indexes (and warn about ones it’d shadow). With grounding, the suggestions are usually right; without it, they’re generic textbook answers.
Query rewrites. Specifically: pulling out scalar subqueries, replacing NOT IN with NOT EXISTS, suggesting SELECT column lists instead of SELECT *. These rewrites are mechanical, the LLM gets them right ~95% of the time, and the rare wrong ones are obvious in a diff.

What it’s bad at

The list is shorter but more important:

Predicting whether a recommended index will help. Predicting query performance from a plan + index proposal requires the optimizer’s cost model. LLMs don’t have it. They’ll happily say “this will reduce execution time by 40%” when the actual answer is “maybe”.
Anomaly detection. An LLM cannot beat a seasonal forecast plus change-point detection on time-series metrics. We tried; it didn’t work; we shipped Prophet + BOCPD instead.
Prioritization in incidents. “Which of these 14 alerts should I look at first?” is a question about your team, your SLOs, and your business. The LLM doesn’t know any of those.

Why grounding matters more than the model

The biggest determinant of output quality is not which model you use — it’s what you feed it. A small model with the right context (schema, top queries, current plan, recent wait events) outperforms a frontier model with no context. We see this consistently in our own benchmarks.

The minimum useful context for query analysis:

The query itself (normalized and original).
The current EXPLAIN plan with rows, costs, and actual times.
The schema of every table the query touches: columns, indexes, row counts.
Recent runtime stats from pg_stat_statements: calls, mean, p99, plan flips.
Cluster context: Postgres version, settings that affect the planner (work_mem, random_page_cost).

Skip any of those and the suggestions degrade. Send all of them and the suggestions are usually actionable on the first read.

BYO LLM in regulated deployments

For regulated customers — fintech, healthcare, gov — sending raw query text to a public LLM is non-negotiable: it would mean account numbers and PII flowing to a third party. Two patterns work:

Customer-hosted LLM endpoint. They run a model in their VPC (Bedrock, Vertex, a self-hosted Llama). The DBM tool calls their endpoint with the query text, gets back a response, and never persists the request. This is the dominant pattern in BYOC deployments.
Aggressive on-prem normalization. Strip every literal value before sending. WHERE email = ‘alice@example.com’ becomes WHERE email = $1. Combined with column-name redaction, this works for a long tail of customers who can’t bring their own model but still need analysis.

FAQ

Which model do you use under the hood?+

We default to Claude for plan narration and rewrites. The deciding factor isn't raw benchmark score — it's how well the model handles long-context grounding (a full schema plus a 30-step plan plus pg_stat_statements rows is a lot of tokens). For BYOC and Sovereign deployments, customers bring their own.

How do you keep AI suggestions from being wrong in obvious ways?+

Three layers. First, structured grounding (above). Second, automated sanity checks — every suggested index goes through a quick simulation against the workload's top 100 queries. Third, every AI output is shown alongside the evidence that produced it; engineers can verify in seconds.

Will AI replace DBAs?+

No. It removes 80% of the toil — the index audits, the plan archaeology, the rewrite suggestions. What's left is the judgment work that's actually expensive: schema design, capacity planning, incident response, talking to product about which queries deserve to exist at all.

What about agentic systems that 'auto-fix' production queries?+

We don't ship that and we don't think anyone responsibly should — yet. The blast radius of a wrong index, a wrong vacuum, a wrong plan force is too high. AI proposes, humans approve, the platform applies.

Keep reading

Postgres

EXPLAIN ANALYZE for Postgres: read every line in 2026

The vocabulary that turns a query plan from a wall of text into a story. Costs, rows, loops, buffers, timing — what each means in 2026 (Pg 16+), and the four anti-patterns to spot in five seconds.

Anomaly detection on database metrics: why thresholds fail and what works

A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.

BYOC

Why regulated SaaS can't use Datadog DBM — and the BYOC fix

Walking through the architecture of a BYOC observability deployment: where data lives, what crosses the boundary, and how to satisfy SOC2 / HIPAA / GDPR without giving up the UX.

← All posts