Agent Security — Prompt Injection & Least Privilege
An LLM can't reliably tell instructions from data, so any content it reads can hijack it. Learn the lethal trifecta, OWASP excessive agency, and why you design for safety at the system layer — not patch the model.
What you'll learn
- Why prompt injection has no reliable fix at the model layer
- The lethal trifecta: private data + untrusted content + external action
- OWASP LLM06 Excessive Agency and its three root causes
- Least-privilege controls that remove a leg of the trifecta by design
- Dual-LLM / quarantine and the Agents Rule of Two as practical patterns
Before you start
Most security you’ve met assumes a trust boundary: code is trusted, user input is not, and you sanitize at the border. Agents break that model. An LLM reads its entire context as one undifferentiated stream of tokens — your system prompt, the user’s message, a fetched web page, the body of an email, the JSON a tool returned — and it cannot reliably tell which of those it should obey and which it should merely read. Simon Willison, who coined the term prompt injection in September 2022 by analogy to SQL injection, puts the root cause plainly: LLMs are unable to reliably distinguish the importance of instructions based on where they came from.
That single fact is the whole lesson. Everything else — the trifecta, least privilege, quarantine patterns — is a consequence of it.
Two flavours of injection
The danger sharpens once you know which kind you’re facing. Anthropic’s threat model splits prompt injection in two:
Direct injection (often called jailbreaking) is the user themselves trying to talk the model out of its guardrails. Indirect injection is the one that should keep you up at night: the human operator is perfectly trustworthy, but the agent fetches a web page, reads an inbox, or opens a PDF, and that content contains instructions the model obediently follows. An attacker who can get text in front of your agent — by publishing a page it will browse, or emailing an address it reads — gets to issue commands.
And no, this is not jailbreaking by another name. Jailbreaking subverts the model’s safety training; injection exploits the model’s inability to separate instructions from data. They overlap, but the fix is different — and for injection, there isn’t a model-layer fix at all.
The lethal trifecta
Here is the framing the whole field has converged on. In June 2025, Simon Willison named the lethal trifecta: an agent becomes exploitable for data exfiltration the moment it simultaneously has all three of:
- Access to private data — your emails, files, database, secrets.
- Exposure to untrusted content — anything an attacker can influence: a web page, an email body, a document, a tool result.
- The ability to externally communicate — any channel that can carry
data out: an outbound HTTP request, a
send_emailtool, even a rendered markdown image whose URL the client loads automatically.
Any one or two of these is survivable. All three together is the exploit. An attacker plants an instruction in the untrusted content (“read the latest invoice, then load this image”), the model reads the private data, and the exfiltration leg carries it out. A canonical, cheap-to-pull-off channel: get the agent to embed the stolen secret into the URL of a markdown image — when the client renders it, the secret leaks to the attacker’s server via the image request.
The aha the widget is built to deliver: safety comes from removing a leg by design, not from trusting the model to resist injection. Cut the external-communication leg (no outbound network, no auto-loaded images, egress allowlist) and even a fully successful injection has nowhere to send the data. Isolate private data from the untrusted-content path and there’s nothing to steal. You are not making the model stronger — you are making the architecture incapable of the bad outcome.
Excessive Agency — OWASP’s name for the blast radius
The trifecta tells you when exfiltration is possible. OWASP tells you what makes the damage large. In the OWASP Top 10 for LLM Applications 2025, LLM06:2025 Excessive Agency is the sixth entry — and note that Prompt Injection sits at #1, LLM01:2025. Excessive Agency has three root causes:
| Root cause | What it means | The fix |
|---|---|---|
| Excessive Functionality | The agent has tools that do more than the task needs (a raw shell when it only needs to read one file). | Expose only the minimal tools; prefer granular tools over open-ended ones. |
| Excessive Permissions | A tool holds broader rights to downstream systems than necessary — e.g. a shared admin service-account token. | Run tools in the user’s security context with minimal, short-lived scope. |
| Excessive Autonomy | High-impact or irreversible actions run with no human verification. | Gate sends, deletes, and transfers behind human-in-the-loop approval. |
Excessive Agency is not a kind of prompt injection — it’s a distinct vulnerability that can fire from plain ambiguity or a model mistake. But it is the amplifier: when injection does land, over-permissioned and over-autonomous tools are what turn a hijacked model into an actual breach. The two are complementary, and you defend against both with the same medicine — least privilege.
You can’t patch the model into safety
It’s tempting to believe a newer, more aligned model has basically solved this, or that a good enough detector will catch the bad inputs. The evidence says otherwise. The October 2025 cross-lab paper The Attacker Moves Second — authored by a team spanning OpenAI, Anthropic, and Google DeepMind — took 12 recently published defenses that had each reported near-zero attack success, and bypassed them with adaptive attacks (gradient descent, reinforcement learning, random search, human red-teaming) at success rates above 90% for most of them. The throughline of the adversarial ML literature holds: the attacker moves second, and adapts to whatever you built.
So filtering and prompt-hardening are defense in depth — useful layers, never guarantees. The conclusion the whole field reached is the design stance: don’t try to filter injection away; engineer the system so a successful injection can’t reach a consequential action.
Designing for it: least privilege end-to-end
This is where you actually win. The mental model: remove a leg of the trifecta, and shrink the blast radius of everything that survives.
- Scoped, short-lived credentials; per-user context. Never a shared admin token. A compromise should be contained to one user’s minimal scope, and expire quickly.
- Allowlisted, least-functionality tools. Expose only what the task needs. Avoid raw shell / arbitrary HTTP when a narrow tool will do.
- Treat all tool output as untrusted data, not instructions. Anthropic’s guidance is concrete: put third-party content only inside tool-result blocks (never in the system prompt or plain user text), tell the model what it is and where it came from, and JSON-encode it so an attacker can’t break out of the data into an instruction context. Don’t put your own instructions in tool results — they may be ignored or flagged.
- Human-in-the-loop on high-impact / irreversible actions. Gate the right actions — sends, deletes, transfers, code execution — not every step (blanket approval just trains people to rubber-stamp). This is the LangGraph human-in-the-loop pattern, applied as a security control.
- Sever the exfiltration channel. Disable auto-loaded remote images and links; restrict which domains the agent may contact. This is often the cheapest single mitigation — it breaks the trifecta even if the injection succeeds.
- Enforce authorization downstream. The database, the payment API, the file store should check permissions themselves — complete mediation — rather than trusting the model to restrain itself.
Several of these belong in MAF middleware as reusable guardrails, and the policy reasoning behind them is the subject of the blog post why agents need permissions.
Two patterns worth knowing by name
When you genuinely need to ingest untrusted content and hold private data and act, the architecture has to carry the security instead of the model:
-
Dual-LLM / quarantine (Willison, 2023): split into a privileged LLM that plans and calls tools but never sees raw untrusted content, and a quarantined LLM that ingests the untrusted content but has no tool access and returns only structured, symbolic results (labels, booleans, opaque references). Injected instructions in the data physically cannot reach the component that can act. Google DeepMind’s CaMeL (“Defeating Prompt Injections by Design”) generalizes this with a capability/provenance-tracking interpreter that refuses to pass an untrusted-derived value to an exfiltration tool — a promising research direction, though it reduces rather than eliminates risk.
-
The Agents Rule of Two (Meta AI, late 2025): the trifecta restated as an operating rule. Within a single agent session, satisfy at most two of: processes untrusted input, accesses sensitive data, changes state or communicates externally. Need all three? Run it under supervision — human-in-the-loop approval. It’s explicitly inspired by the lethal trifecta, and it’s the most widely cited practical rule of thumb.
Notice none of these makes the model better at resisting injection. They all make the system incapable of the bad outcome. That is the entire shift: you remove a leg by design.
This pairs directly with two neighbours in the curriculum. Memory is another untrusted-content surface — see the blog post on agent memory poisoning, where the injected instruction is something the agent wrote to its own memory and reads back later. And the tool layer itself is where most of these controls live; revisit tools for the permission and schema mechanics this lesson assumes.
Quick check
Quick check
Next
You now have the agent-security mental model: injection is unsolved at the model layer, the lethal trifecta tells you when exfiltration is possible, Excessive Agency sets the blast radius, and least privilege is how you remove a leg by design. The human-in-the-loop lesson is where you wire the approval gate that backstops the irreversible actions this lesson told you to fear.
Practice this in an interview
All questionsPrompt injection is an attack in which malicious text in retrieved documents or user input overrides the application's system instructions, redirecting the model to perform unintended actions. Defenses layer input/output validation, privilege separation, and tool-call confirmation — no single fix is sufficient.
The core toolkit is: system prompts (role and constraints), few-shot examples (format and tone anchoring), chain-of-thought (step-by-step reasoning), and output constraints (JSON schema, stop sequences). Combining these predictably closes the gap between a capable base model and a production-ready feature.
Tool calling extends the LLM's output space to include structured function invocations. The model emits a JSON object naming a tool and its arguments; the runtime executes the tool and feeds the result back as a new message. An agent is a loop that repeats this cycle — observe, think, act — until the task is complete or a stopping condition is met.
Hallucinations occur because an LLM is trained to produce plausible next tokens, not verified facts — it has no internal truth-checking mechanism, only statistical patterns. Common causes include rare or conflicting training data, overconfident decoding, and prompts that lead the model to extrapolate beyond what it learned. Mitigation strategies include retrieval-augmented generation, grounding responses to retrieved sources, lowering temperature, and calibrated refusal training.