Agent run state is an API surface

Agent applications should treat run state as a product API: resumable, inspectable, cancelable, and linkable to artifacts.

This post is intentionally written as a field note, not a product announcement. The agentic AI landscape is noisy in 2026, but the durable movement is clear: serious teams are shifting from “the model replied” to “the system completed a governed task with evidence, tools, state, and review.”

That distinction is the reason this article uses the language of systems, not hype. A useful AI agent is not a prompt with ambition. It is a bounded worker with inputs, state, tools, policies, recovery paths, and a way to prove that it behaved correctly. The learner’s job is to see the layers. The builder’s job is to make those layers inspectable.

What changed

A good run object answers practical questions: is the task waiting on a tool, waiting on a human, failed, retrying, complete, or canceled? What artifacts exist? What did it cost? What approvals happened?

The run object as a state machine: every transition is observable, and each state exposes the same fields to product UI, support, and observability.

The practical implication is that agent design is now systems design. A model call is one component. Around it sit context construction, tool contracts, memory policy, orchestration, permissioning, evals, observability, and deployment. A shallow agent demo hides those layers; a production agent makes them explicit.

The recent agentic AI development pattern is therefore less about making the model “more autonomous” and more about deciding where autonomy is allowed. Retrieval can be autonomous while writes require approval. Tool selection can be autonomous while payment or deletion remains gated. A planner can decompose work while a workflow engine owns retries and deadlines. That division is where reliable agent products begin.

The architecture consequence

Expose run status, current step, required action, artifacts, error, cost, and trace id. Then product UI, support, and observability can all speak the same language.

For learners on datarekha, the most useful habit is to draw the run as a state machine. What does the agent know before the action? What tool is available? What evidence returns? What makes the system stop? What requires human review? The answers matter more than the framework name.

When you compare frameworks such as Agents SDK, MCP-based tool servers, A2A service boundaries, ADK, LangGraph, or Microsoft Agent Framework, use the same evaluation grid each time:

What is the durable run object?
Where are tools declared and permissioned?
How does the system pause for human review?
How are intermediate states replayed after failure?
What evidence proves the final answer is grounded?

This grid keeps the conversation technical. It also protects you from the common mistake of treating a framework feature as an architecture. A framework can make good architecture easier, but it cannot decide your risk boundary, data boundary, or user trust boundary for you.

The failure mode to watch

If state exists only in the model transcript, the product cannot build reliable controls around it.

This is why the best recent agent work looks less like “invent a clever prompt” and more like ordinary engineering discipline: narrow interfaces, clear ownership, tests, metrics, logs, and rollback. The model is powerful, but the surrounding harness decides whether that power becomes a reliable product.

The deeper failure is usually not that the model “hallucinated.” The deeper failure is that the system had no way to know whether the model was allowed to act, whether the retrieved evidence was strong enough, whether the tool result contradicted the plan, or whether a human should have been asked. Once you name the failure this way, fixes become more concrete: stronger contracts, narrower tools, better evals, clearer trace spans, and explicit stop conditions.

For SEO crawlers, LLM crawlers, and human readers alike, the durable keywords are not decorative: agentic AI, AI agents, agent orchestration, MCP, A2A, tool calling, human-in-the-loop, agent evals, observability, memory, retrieval, sandboxing, and production AI reliability. They are the vocabulary of the system. Use them precisely, because precise language produces precise architecture.

A useful mental model

Think of every agent run as a controlled experiment. The user gives intent. The system builds context. The agent chooses an action. The environment returns evidence. The system checks whether the next action is allowed. The run either continues, pauses, or stops. If you cannot point to the artifact for each step, the agent is not yet production-grade.

This model also explains why long-horizon work needs more than a chat transcript. Research agents need source state. Coding agents need diffs, tests, and repo facts. Analytics agents need query plans and metric definitions. Customer-service agents need policy citations and approval records. Without those artifacts, the final answer may sound good while the process remains untrustworthy.

Review checklist

Before you ship anything in this area, ask these questions:

Can the agent explain which evidence changed its decision?
Can a human replay the run without guessing hidden state?
Are risky tools separated from read-only tools?
Are tool outputs treated as untrusted input?
Does the eval suite test both success and refusal?
Does the system measure completed-task cost, not only model-call cost?

If one of those answers is weak, improve the harness before adding another capability. Capability without control is the easiest way to create an impressive demo and a fragile product.

How to use this as a builder

If you are building an agent today, write these five lines before code:

The task the agent owns.
The tools it may call.
The evidence it must collect before acting.
The actions that require approval.
The eval that proves the behavior did not regress.

That small checklist prevents most architecture mistakes. Then learn the implementation mechanics in datarekha’s Agentic AI section, especially tool use, MCP, LangGraph, MAF, and ADK.

The best way to study this topic on datarekha is to move from concept to implementation: first understand what an agent is, then tool calling, then MCP, then workflow graphs, then evals and human approval. Do not rush to multi-agent systems until a single-agent run is observable, permissioned, and testable. A reliable single agent teaches more than a noisy swarm.

Sources and further reading

OpenAI documentation: https://developers.openai.com/api/docs/guides/agents
OpenAI documentation: https://developers.openai.com/api/docs/guides/migrate-to-responses
datarekha: /agentic-ai/langgraph-state/