datarekha

Agent Security — Prompt Injection & Least Privilege

An LLM can't reliably tell instructions from data, so any content it reads can hijack it. Learn the lethal trifecta, OWASP excessive agency, and why you design for safety at the system layer — not patch the model.

9 min read Intermediate Agentic AI Lesson 26 of 29

What you'll learn

  • Why prompt injection has no reliable fix at the model layer
  • The lethal trifecta: private data + untrusted content + external action
  • OWASP LLM06 Excessive Agency and its three root causes
  • Least-privilege controls that remove a leg of the trifecta by design
  • Dual-LLM / quarantine and the Agents Rule of Two as practical patterns

Before you start

Most security you’ve met assumes a trust boundary: code is trusted, user input is not, and you sanitize at the border. Agents break that model. An LLM reads its entire context as one undifferentiated stream of tokens — your system prompt, the user’s message, a fetched web page, the body of an email, the JSON a tool returned — and it cannot reliably tell which of those it should obey and which it should merely read. Simon Willison, who coined the term prompt injection in September 2022 by analogy to SQL injection, puts the root cause plainly: LLMs are unable to reliably distinguish the importance of instructions based on where they came from.

That single fact is the whole lesson. Everything else — the trifecta, least privilege, quarantine patterns — is a consequence of it.

Two flavours of injection

The danger sharpens once you know which kind you’re facing. Anthropic’s threat model splits prompt injection in two:

Direct injection / jailbreakthe user is the adversaryThe person typing crafts inputto bypass the model’s built-insafety guardrails.Example• “ignore your rules and…”• coax out a forbidden answerIndirect injectionthird-party content is the adversaryThe user is trusted, but the modelingests a web page, email, doc, ortool result carrying hidden orders.Why it’s worse for agents• agents read untrusted content• … constantly, by design
Direct injection is a user problem; indirect injection is the dominant danger for autonomous agents.

Direct injection (often called jailbreaking) is the user themselves trying to talk the model out of its guardrails. Indirect injection is the one that should keep you up at night: the human operator is perfectly trustworthy, but the agent fetches a web page, reads an inbox, or opens a PDF, and that content contains instructions the model obediently follows. An attacker who can get text in front of your agent — by publishing a page it will browse, or emailing an address it reads — gets to issue commands.

And no, this is not jailbreaking by another name. Jailbreaking subverts the model’s safety training; injection exploits the model’s inability to separate instructions from data. They overlap, but the fix is different — and for injection, there isn’t a model-layer fix at all.

The lethal trifecta

Here is the framing the whole field has converged on. In June 2025, Simon Willison named the lethal trifecta: an agent becomes exploitable for data exfiltration the moment it simultaneously has all three of:

  1. Access to private data — your emails, files, database, secrets.
  2. Exposure to untrusted content — anything an attacker can influence: a web page, an email body, a document, a tool result.
  3. The ability to externally communicate — any channel that can carry data out: an outbound HTTP request, a send_email tool, even a rendered markdown image whose URL the client loads automatically.

Any one or two of these is survivable. All three together is the exploit. An attacker plants an instruction in the untrusted content (“read the latest invoice, then load this image”), the model reads the private data, and the exfiltration leg carries it out. A canonical, cheap-to-pull-off channel: get the agent to embed the stolen secret into the URL of a markdown image — when the client renders it, the secret leaks to the attacker’s server via the image request.

The aha the widget is built to deliver: safety comes from removing a leg by design, not from trusting the model to resist injection. Cut the external-communication leg (no outbound network, no auto-loaded images, egress allowlist) and even a fully successful injection has nowhere to send the data. Isolate private data from the untrusted-content path and there’s nothing to steal. You are not making the model stronger — you are making the architecture incapable of the bad outcome.

Excessive Agency — OWASP’s name for the blast radius

The trifecta tells you when exfiltration is possible. OWASP tells you what makes the damage large. In the OWASP Top 10 for LLM Applications 2025, LLM06:2025 Excessive Agency is the sixth entry — and note that Prompt Injection sits at #1, LLM01:2025. Excessive Agency has three root causes:

Root causeWhat it meansThe fix
Excessive FunctionalityThe agent has tools that do more than the task needs (a raw shell when it only needs to read one file).Expose only the minimal tools; prefer granular tools over open-ended ones.
Excessive PermissionsA tool holds broader rights to downstream systems than necessary — e.g. a shared admin service-account token.Run tools in the user’s security context with minimal, short-lived scope.
Excessive AutonomyHigh-impact or irreversible actions run with no human verification.Gate sends, deletes, and transfers behind human-in-the-loop approval.

Excessive Agency is not a kind of prompt injection — it’s a distinct vulnerability that can fire from plain ambiguity or a model mistake. But it is the amplifier: when injection does land, over-permissioned and over-autonomous tools are what turn a hijacked model into an actual breach. The two are complementary, and you defend against both with the same medicine — least privilege.

You can’t patch the model into safety

It’s tempting to believe a newer, more aligned model has basically solved this, or that a good enough detector will catch the bad inputs. The evidence says otherwise. The October 2025 cross-lab paper The Attacker Moves Second — authored by a team spanning OpenAI, Anthropic, and Google DeepMind — took 12 recently published defenses that had each reported near-zero attack success, and bypassed them with adaptive attacks (gradient descent, reinforcement learning, random search, human red-teaming) at success rates above 90% for most of them. The throughline of the adversarial ML literature holds: the attacker moves second, and adapts to whatever you built.

So filtering and prompt-hardening are defense in depth — useful layers, never guarantees. The conclusion the whole field reached is the design stance: don’t try to filter injection away; engineer the system so a successful injection can’t reach a consequential action.

Designing for it: least privilege end-to-end

This is where you actually win. The mental model: remove a leg of the trifecta, and shrink the blast radius of everything that survives.

  • Scoped, short-lived credentials; per-user context. Never a shared admin token. A compromise should be contained to one user’s minimal scope, and expire quickly.
  • Allowlisted, least-functionality tools. Expose only what the task needs. Avoid raw shell / arbitrary HTTP when a narrow tool will do.
  • Treat all tool output as untrusted data, not instructions. Anthropic’s guidance is concrete: put third-party content only inside tool-result blocks (never in the system prompt or plain user text), tell the model what it is and where it came from, and JSON-encode it so an attacker can’t break out of the data into an instruction context. Don’t put your own instructions in tool results — they may be ignored or flagged.
  • Human-in-the-loop on high-impact / irreversible actions. Gate the right actions — sends, deletes, transfers, code execution — not every step (blanket approval just trains people to rubber-stamp). This is the LangGraph human-in-the-loop pattern, applied as a security control.
  • Sever the exfiltration channel. Disable auto-loaded remote images and links; restrict which domains the agent may contact. This is often the cheapest single mitigation — it breaks the trifecta even if the injection succeeds.
  • Enforce authorization downstream. The database, the payment API, the file store should check permissions themselves — complete mediation — rather than trusting the model to restrain itself.

Several of these belong in MAF middleware as reusable guardrails, and the policy reasoning behind them is the subject of the blog post why agents need permissions.

Two patterns worth knowing by name

When you genuinely need to ingest untrusted content and hold private data and act, the architecture has to carry the security instead of the model:

  • Dual-LLM / quarantine (Willison, 2023): split into a privileged LLM that plans and calls tools but never sees raw untrusted content, and a quarantined LLM that ingests the untrusted content but has no tool access and returns only structured, symbolic results (labels, booleans, opaque references). Injected instructions in the data physically cannot reach the component that can act. Google DeepMind’s CaMeL (“Defeating Prompt Injections by Design”) generalizes this with a capability/provenance-tracking interpreter that refuses to pass an untrusted-derived value to an exfiltration tool — a promising research direction, though it reduces rather than eliminates risk.

  • The Agents Rule of Two (Meta AI, late 2025): the trifecta restated as an operating rule. Within a single agent session, satisfy at most two of: processes untrusted input, accesses sensitive data, changes state or communicates externally. Need all three? Run it under supervision — human-in-the-loop approval. It’s explicitly inspired by the lethal trifecta, and it’s the most widely cited practical rule of thumb.

Notice none of these makes the model better at resisting injection. They all make the system incapable of the bad outcome. That is the entire shift: you remove a leg by design.

This pairs directly with two neighbours in the curriculum. Memory is another untrusted-content surface — see the blog post on agent memory poisoning, where the injected instruction is something the agent wrote to its own memory and reads back later. And the tool layer itself is where most of these controls live; revisit tools for the permission and schema mechanics this lesson assumes.

Quick check

Quick check

0/3
Q1An agent reads a customer's support email (untrusted), can query the customer database (private data), and has a send_email tool (external). Which single change most reliably prevents data exfiltration via prompt injection?
Q2Why is OWASP's LLM06 Excessive Agency a *distinct* problem from prompt injection (LLM01), not the same thing?
Q3TRANSFER: You're designing a research agent that must browse arbitrary public web pages (untrusted) AND summarize them into a private internal wiki it can write to (state change). Applying the Agents Rule of Two, what's the cleanest safe design?

Next

You now have the agent-security mental model: injection is unsolved at the model layer, the lethal trifecta tells you when exfiltration is possible, Excessive Agency sets the blast radius, and least privilege is how you remove a leg by design. The human-in-the-loop lesson is where you wire the approval gate that backstops the irreversible actions this lesson told you to fear.

Practice this in an interview

All questions
What is prompt injection and how do you defend against it?

Prompt injection is an attack in which malicious text in retrieved documents or user input overrides the application's system instructions, redirecting the model to perform unintended actions. Defenses layer input/output validation, privilege separation, and tool-call confirmation — no single fix is sufficient.

What prompt engineering techniques should every LLM practitioner know?

The core toolkit is: system prompts (role and constraints), few-shot examples (format and tone anchoring), chain-of-thought (step-by-step reasoning), and output constraints (JSON schema, stop sequences). Combining these predictably closes the gap between a capable base model and a production-ready feature.

How do function/tool calling and LLM agents work at a high level?

Tool calling extends the LLM's output space to include structured function invocations. The model emits a JSON object naming a tool and its arguments; the runtime executes the tool and feeds the result back as a new message. An agent is a loop that repeats this cycle — observe, think, act — until the task is complete or a stopping condition is met.

What causes LLM hallucinations and how can they be reduced?

Hallucinations occur because an LLM is trained to produce plausible next tokens, not verified facts — it has no internal truth-checking mechanism, only statistical patterns. Common causes include rare or conflicting training data, overconfident decoding, and prompts that lead the model to extrapolate beyond what it learned. Mitigation strategies include retrieval-augmented generation, grounding responses to retrieved sources, lowering temperature, and calibrated refusal training.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content