AI SREs in production: Resolve.ai, Cleric, Parity

The pager goes off at 3:42 AM. Production latency for /api/v1/checkout has spiked from 240ms to 4.8s. By the time the on-call engineer opens her laptop at 3:51 AM, an AI SRE has already pulled the relevant deployment diff (a feature flag was flipped 22 minutes ago), correlated it with a downstream Redis CPU spike, identified the offending query pattern, and posted a thread in Slack with: “Hypothesis: the new checkout-v2 path issues N+1 reads against the inventory cache. Evidence attached. Suggest reverting flag and reviewing query.”

The engineer’s job is now 90% done. She reads the evidence, confirms the hypothesis, reverts the flag. Page resolved at 4:03 AM. MTTR: 21 minutes instead of the usual 90.

This is not the future. This is what Resolve.ai, Cleric, and Parity are shipping today, at companies with well-instrumented services and mature observability stacks. The category went from “interesting demo in 2024” to “real line item on the SRE budget in 2026” faster than most people noticed. The reason is that the architectural problem turned out to be tractable in a way that most agent categories are not.

Why the on-call surface is a good agent fit

Software operations is full of work that looks like agentic reasoning but is actually structured: an alert fires, you check the last deploy, you look at the affected service’s metrics, you tail the logs, you cross-reference with related services, you form a hypothesis, you check the runbook, you take an action. The decision tree is familiar; the variance is in which leaf you end up at.

This is exactly the shape that LLMs are good at when they’re given the right tools. Each step in the on-call decision tree is a tool call — read deploys, query Prometheus, search logs in Datadog, check the runbook in Notion. The LLM’s job is to traverse the tree intelligently and summarise the findings. That’s not “general AGI”; that’s “an agent with MCP connectors and a hypothesis-testing prompt loop.”

Three other properties of the on-call domain make it a particularly nice agent target:

The tools are well-defined. Every observability vendor has an API. Datadog, Grafana, Prometheus, Splunk, New Relic, Honeycomb, Loki — all of them speak in queries that return rows. The model doesn’t need to invent the tool surface; it just needs to use it.
The ground truth is observable. Did the page resolve? Did the error rate come back down? Did the deploy that was reverted fix the thing? Each of these is a measurable signal that closes the loop.
The customer is happy to limit scope. Unlike “let the agent edit my code,” “let the agent read my logs” is a permission most SRE teams will grant without a year of legal review. Read-only is the default and it covers 80% of the value.

The category is sometimes called AIOps reimagined as agents, but the more honest framing is that it’s the first agent category where read-only access plus a reasoning loop produces measurable business value without needing the agent to take any risky action.

The architecture, condensed

All three of the leading AI SRE startups — and the in-house systems at companies like Honeycomb and Fly.io — have converged on roughly the same shape. The differences are in the details and the marketing, not the architecture.

The converged architecture. The MCP connectors are the moat, the hypothesis loop is the smarts, the Slack handoff is the UX, and the read-only constraint is the safety net.

The pieces, in slightly more detail:

MCP-style connectors to every observability and ops tool the customer uses. These are the agent’s senses. Resolve.ai, Cleric, and Parity all advertise 10+ supported integrations on day one. The list is roughly the same: Datadog, Grafana, Prometheus, Splunk, New Relic, Honeycomb, ELK/Loki, Kubernetes, GitHub, Jira, plus whatever PagerDuty/Opsgenie/incident.io your team is on.
A planner / orchestrator that takes the alert payload and forms a structured set of hypotheses to test. This is the most model-dependent piece; the quality of the hypotheses determines the quality of the investigation.
Parallel hypothesis testing. Each hypothesis is a small chain of tool calls — “if this is a deploy regression, the relevant signals are X, Y, Z; query each and rank evidence.” Resolve.ai’s May 2026 expansion announcement highlighted their new “investigation architecture” as a coordinated team of specialised agents investigating multiple hypotheses in parallel, reporting a 2× improvement in root-cause accuracy from the change.
A handoff into the human workflow. Almost always Slack, sometimes PagerDuty’s native UI, sometimes both. The structured output — ranked hypotheses, evidence summary, suggested remediation — is the product. The engineer’s role is to read it, override the ranking if needed, and authorise any action.
Read-only by default. This is the most important design choice. Cleric is explicit about being read-only; Parity says its agents exclusively execute read-only commands; Resolve.ai allows optional write actions behind an engineer approval gate. The posture is “we will tell you what to do, we will not do it.” This is the right default and probably will remain the right default for several more years.

What the numbers actually say

The category is young enough that the published numbers come mostly from vendor case studies, but the patterns are consistent enough to believe. The MTTR improvements cluster in the 20-60% range on “routine” incidents, defined roughly as “the alert has fired before and the root cause is in a known set.”

Cleric reports a 20% average MTTR improvement and an average time-to-root-cause of about 5 minutes across its customer base. Their public positioning is “read-only, self-learning investigation with Gartner validation” — they explicitly decline to do remediation.
Resolve.ai reports that their always-on agents typically triage every on-call alert within 5 minutes before an engineer is involved, and that engineers opening their dashboard at the start of an incident see verified findings rather than starting from scratch. Customers like Loom and Snowflake are publicly cited.
Parity is the most opinionated about Kubernetes-specific runbook automation. Their architecture is unusual: instead of one large agent, each step in a runbook is its own small agent, with a separate evaluator agent deciding which step to run next. The result is a runbook system that supports cycles, retries, and conditional branching — closer to a finite state machine where each state is LLM-driven.

The pattern in industry reports more broadly: AI-powered incident response can cut MTTR by 60% in well-instrumented SRE teams, but the gains come almost entirely from compressing the investigation phase, not from automating remediation. The agent isn’t fixing your prod; it’s saving the engineer 40 minutes of kubectl describe and Datadog tab juggling.

Where the pattern breaks: novel incidents

The honest limit on every shipped AI SRE today is the same one: they struggle with incidents that don’t look like anything that’s happened before.

The classic worst case is a cascading failure across three services with a non-obvious chain — your image-processing service is slow because a third-party dependency in your auth service is timing out because the auth service’s database is being throttled because someone shipped a query change in an unrelated service that’s now hammering the shared connection pool. No alert catches the chain end-to-end. No runbook covers it. The model can find pieces of evidence but cannot synthesize the story.

The teams who deploy AI SREs successfully have learned to set their expectations accordingly. The agent is great for:

Recurring alert classes with a known remediation. Latency spike after deploy → check the deploy diff → revert. Memory leak in service X → check the canary → bounce the pod.
First-line triage, where the agent’s job is to gather evidence while the engineer is making coffee. The engineer arrives at the console with most of the data already collected.
Post-incident summaries, where the model is reading logs and writing a structured retro. This is unglamorous and high-value.

The agent is bad for:

First-time-ever incidents with no precedent in the team’s history. The agent will gamely produce hypotheses; the hypotheses will be wrong.
Cross-team failures where the data lives in observability stacks the agent doesn’t have access to.
Anything requiring product judgement — “is this important enough to wake up the VP?” — which the agent shouldn’t be asked to make.

The teams that ship the most reliable on-call workflows have an explicit human-in-the-loop policy at the moment the agent admits it’s out of its depth. Resolve.ai calls this an “escalation policy”; Cleric calls it “confidence-based handoff”; the upshot is the same. When the agent’s confidence in its top hypothesis drops below a threshold, it escalates immediately rather than burning more time investigating.

The PagerDuty response

The incumbent in this market is, of course, PagerDuty, which has spent two decades building the workflow layer that every team uses on top of their observability stack. PagerDuty’s response to the AI SRE entrants has been to ship Copilot, an SRE Agent, and an Insights Agent of their own, integrated tightly with their existing on-call schedules and escalation policies.

The interesting strategic dimension is which side wins. The PagerDuty bet is that the workflow integration is the moat — if every on-call engineer already lives in PagerDuty, putting the AI agent there beats putting it in a separate tool. The Resolve / Cleric / Parity bet is that the investigation quality is the moat — a purpose-built agent will outperform a workflow product’s bolt-on AI, and the customer will adopt the better investigation tool even if it sits outside their existing workflow.

PagerDuty has also made a smart move on the integration front: they shipped an MCP plugin for Cursor and an Incident Responder Custom Agent for GitHub. The play is to be the on-call data layer for whatever agent the developer is using, rather than trying to own the agent itself. This is the classic incumbent move — sell the data and the workflow, let the new entrants experiment with the model layer above you.

By mid-2026 the market looks bimodal. Mature enterprise teams with heavy PagerDuty contracts are mostly going with PagerDuty’s own agents because the integration cost is zero. Newer teams and mid-market customers are picking Resolve, Cleric, or Parity because the AI quality is meaningfully better. The next year will tell whether the AI quality gap closes faster than the workflow integration gap opens.

Where this category goes next

Three things I’d expect over the next 12-18 months, based on the trajectory of the shipped products and the public roadmaps:

Write actions behind tight guardrails. Every vendor will start adding gated remediation actions — restart the pod, scale the deployment, revert the deploy — behind explicit engineer approval. Resolve.ai has already been moving in this direction with their always-on agents. The constraint is going to be regulatory and security, not technical.

Cross-incident memory and pattern learning. The current generation of agents mostly investigates each incident in isolation. The next generation will recognise “this is the third time we’ve seen this pattern; here’s what worked last time,” using a vector store of past incident summaries as additional context. Cleric is already positioning around this with their “continuous learning” framing.

Tighter integration with the deploy and change pipeline. The most common root cause of incidents is “we deployed a thing.” The natural extension is for the AI SRE to also be the AI release engineer, correlating alerts to specific PRs and feature flags from the moment of deploy. PagerDuty’s GitHub integration is the early sign of this.

The thing not to expect is the autonomous on-call agent that resolves every incident without a human. That is a long way off, and might never arrive, because the cost of getting it wrong on a high-severity incident is catastrophic. The valuable thing is the agent that makes the human’s job 90% easier, and that is the version shipping today.

What to take away

AI SREs work because the on-call workflow was a hidden structured task. Tool surface is well-defined, ground truth is observable, read-only access is sufficient for most of the value. This is the most agent-friendly enterprise domain by structure, and the numbers reflect it.
The architecture has converged. MCP connectors, parallel hypothesis testing, Slack handoff, read-only default. The differences between Resolve, Cleric, and Parity are real but small; the category-level pattern is locked.
Mature teams are buying for MTTR; new teams are buying for staffing. A 50-engineer SRE shop adopts an AI SRE because it compresses incident time; a 5-engineer team adopts it because they can’t afford a 24/7 rotation. Both buying motions are real, and they suggest the category gets bigger from both ends.
The incumbent is the workflow, not the model. PagerDuty doesn’t need the best AI SRE; it needs an AI SRE that’s good enough and built into the on-call rotation the customer already uses. That’s a strong position. Whether it’s a winning position depends on whether the model quality gap closes.

The category is the cleanest example of the post-hype agent era — a narrow, well-scoped problem, a real customer with a real budget, an honest set of capabilities, and a measurable result. The AGI demos get the press; this is the work that actually pays the bill.

Further reading: Resolve.ai’s investigation-architecture announcement covers the multi-agent hypothesis-testing pattern in useful detail. Parity’s HN launch thread is a frank discussion of the Kubernetes-runbook architecture, including the limits. PagerDuty’s H2 2025 product launch is the clearest signal of the incumbent’s response. For a third-party view of the category, Metoro’s top 17 AI SRE tools is a useful directory.