Tool-output prompt injection in agent systems

Tool-output prompt injection is fundamental to agent systems. If the model reads untrusted text, that text can attempt to steer the agent.

This post is intentionally written as a field note, not a product announcement. The agentic AI landscape is noisy in 2026, but the durable movement is clear: serious teams are shifting from “the model replied” to “the system completed a governed task with evidence, tools, state, and review.”

That distinction is the reason this article uses the language of systems, not hype. A useful AI agent is not a prompt with ambition. It is a bounded worker with inputs, state, tools, policies, recovery paths, and a way to prove that it behaved correctly. The learner’s job is to see the layers. The builder’s job is to make those layers inspectable.

What changed

Web pages, tickets, documents, emails, PDFs, and code comments can all contain malicious or accidental instructions. The defense is not one magic prompt; it is layered boundaries between evidence and control.

The practical implication is that agent design is now systems design. A model call is one component. Around it sit context construction, tool contracts, memory policy, orchestration, permissioning, evals, observability, and deployment. A shallow agent demo hides those layers; a production agent makes them explicit.

The recent agentic AI development pattern is therefore less about making the model “more autonomous” and more about deciding where autonomy is allowed. Retrieval can be autonomous while writes require approval. Tool selection can be autonomous while payment or deletion remains gated. A planner can decompose work while a workflow engine owns retries and deadlines. That division is where reliable agent products begin.

Defense is not one magic prompt; it is a boundary that keeps tool output as data and never lets it become control.

The architecture consequence

Separate system instructions from evidence, strip or quote untrusted content, minimize tool permissions, require approval for risky actions, and test injection examples in evals.

For learners on datarekha, the most useful habit is to draw the run as a state machine. What does the agent know before the action? What tool is available? What evidence returns? What makes the system stop? What requires human review? The answers matter more than the framework name.

When you compare frameworks such as Agents SDK, MCP-based tool servers, A2A service boundaries, ADK, LangGraph, or Microsoft Agent Framework, use the same evaluation grid each time:

What is the durable run object?
Where are tools declared and permissioned?
How does the system pause for human review?
How are intermediate states replayed after failure?
What evidence proves the final answer is grounded?

This grid keeps the conversation technical. It also protects you from the common mistake of treating a framework feature as an architecture. A framework can make good architecture easier, but it cannot decide your risk boundary, data boundary, or user trust boundary for you.

The failure mode to watch

Many systems tell the model to ignore malicious instructions but still expose powerful tools and secrets in the same context.

This is why the best recent agent work looks less like “invent a clever prompt” and more like ordinary engineering discipline: narrow interfaces, clear ownership, tests, metrics, logs, and rollback. The model is powerful, but the surrounding harness decides whether that power becomes a reliable product.

The deeper failure is usually not that the model “hallucinated.” The deeper failure is that the system had no way to know whether the model was allowed to act, whether the retrieved evidence was strong enough, whether the tool result contradicted the plan, or whether a human should have been asked. Once you name the failure this way, fixes become more concrete: stronger contracts, narrower tools, better evals, clearer trace spans, and explicit stop conditions.

For SEO crawlers, LLM crawlers, and human readers alike, the durable keywords are not decorative: agentic AI, AI agents, agent orchestration, MCP, A2A, tool calling, human-in-the-loop, agent evals, observability, memory, retrieval, sandboxing, and production AI reliability. They are the vocabulary of the system. Use them precisely, because precise language produces precise architecture.

A useful mental model

Think of every agent run as a controlled experiment. The user gives intent. The system builds context. The agent chooses an action. The environment returns evidence. The system checks whether the next action is allowed. The run either continues, pauses, or stops. If you cannot point to the artifact for each step, the agent is not yet production-grade.

This model also explains why long-horizon work needs more than a chat transcript. Research agents need source state. Coding agents need diffs, tests, and repo facts. Analytics agents need query plans and metric definitions. Customer-service agents need policy citations and approval records. Without those artifacts, the final answer may sound good while the process remains untrustworthy.

Review checklist

Before you ship anything in this area, ask these questions:

Can the agent explain which evidence changed its decision?
Can a human replay the run without guessing hidden state?
Are risky tools separated from read-only tools?
Are tool outputs treated as untrusted input?
Does the eval suite test both success and refusal?
Does the system measure completed-task cost, not only model-call cost?

If one of those answers is weak, improve the harness before adding another capability. Capability without control is the easiest way to create an impressive demo and a fragile product.

How to use this as a builder

If you are building an agent today, write these five lines before code:

The task the agent owns.
The tools it may call.
The evidence it must collect before acting.
The actions that require approval.
The eval that proves the behavior did not regress.

That small checklist prevents most architecture mistakes. Then learn the implementation mechanics in datarekha’s Agentic AI section, especially tool use, MCP, LangGraph, MAF, and ADK.

The best way to study this topic on datarekha is to move from concept to implementation: first understand what an agent is, then tool calling, then MCP, then workflow graphs, then evals and human approval. Do not rush to multi-agent systems until a single-agent run is observable, permissioned, and testable. A reliable single agent teaches more than a noisy swarm.

Sources and further reading

Anthropic engineering: https://www.anthropic.com/engineering/building-effective-agents
Anthropic engineering: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
datarekha: /agentic-ai/mcp-intro/