What are the major security risks of deploying autonomous agents?

Key risks include prompt injection, especially indirect injection via tool or retrieval outputs, hijacking the agent, excessive tool permissions enabling damaging actions, data exfiltration, confused-deputy privilege escalation, and unbounded loops driving cost or harm. Mitigations include least-privilege tools, sandboxing, input and output guardrails, human-in-the-loop approval for sensitive actions, and audit logging.

How would you defend an LLM application against prompt injection?

No single fix is complete, so defenses are layered: separate trusted instructions from untrusted data, constrain and least-privilege the tools and actions the model can take, validate and sanitize inputs and tool outputs, add output guardrails and injection classifiers, and keep a human in the loop for sensitive actions. Treat all external or retrieved content as untrusted.

What is prompt injection and how do you defend against it?

Prompt injection is an attack in which malicious text in retrieved documents or user input overrides the application's system instructions, redirecting the model to perform unintended actions. Defenses layer input/output validation, privilege separation, and tool-call confirmation — no single fix is sufficient.

How would you prevent an AI agent from leaking or misusing API credentials?

Keep raw credentials outside model context and traces. Let the model propose typed intent, authorize the final action and arguments deterministically, then have a trusted executor inject a short-lived, narrowly scoped, audience-restricted credential for one call. Re-authorize downstream and gate high-impact writes with explicit approval.

Agent Security — Prompt Injection & Least Privilege — Agentic AI

Most security you’ve met assumes a trust boundary: code is trusted, user input is not, and you sanitize at the border. Agents break that model. An LLM reads its entire context as one undifferentiated stream of tokens — your system prompt, the user’s message, a fetched web page, the body of an email, the JSON a tool returned — and it cannot reliably tell which of those it should obey and which it should merely read. Simon Willison, who coined the term prompt injection in September 2022 by analogy to SQL injection, puts the root cause plainly: LLMs are unable to reliably distinguish the importance of instructions based on where they came from.

That single fact is the whole lesson. Everything else — the trifecta, least privilege, quarantine patterns — is a consequence of it.

TryLethal-trifecta threat model

Toggle the agent's capabilities. Watch when it becomes exploitable.

An agent is safe to attack only when it has all three legs at once: private data + untrusted content + external action. Flip the legs, then apply least-privilege defenses and watch the risk drop — because a leg got removed, not because the model got stronger.

EXPLOITABLE — lethal trifectaAll three legs live. An injection can read private data and send it out.

Apply least-privilege defenses

This is the exploit. A page or email the agent reads carries a hidden instruction; the model — which can't tell instructions from data — reads your private data and ships it out (often by embedding it in an image URL). No system prompt reliably stops this. Apply a defense above to break a leg.

Two flavours of injection

The danger sharpens once you know which kind you’re facing. Anthropic’s threat model splits prompt injection in two:

Direct injection is a user problem; indirect injection is the dominant danger for autonomous agents.

Direct injection (often called jailbreaking) is the user themselves trying to talk the model out of its guardrails. Indirect injection is the one that should keep you up at night: the human operator is perfectly trustworthy, but the agent fetches a web page, reads an inbox, or opens a PDF, and that content contains instructions the model obediently follows. An attacker who can get text in front of your agent — by publishing a page it will browse, or emailing an address it reads — gets to issue commands.

And no, this is not jailbreaking by another name. Jailbreaking subverts the model’s safety training; injection exploits the model’s inability to separate instructions from data. They overlap, but the fix is different — and for injection, there isn’t a model-layer fix at all.

The lethal trifecta

Here is the framing the whole field has converged on. In June 2025, Simon Willison named the lethal trifecta: an agent becomes exploitable for data exfiltration the moment it simultaneously has all three of:

Access to private data — your emails, files, database, secrets.
Exposure to untrusted content — anything an attacker can influence: a web page, an email body, a document, a tool result.
The ability to externally communicate — any channel that can carry data out: an outbound HTTP request, a send_email tool, even a rendered markdown image whose URL the client loads automatically.

Any one or two of these is survivable. All three together is the exploit. An attacker plants an instruction in the untrusted content (“read the latest invoice, then load this image”), the model reads the private data, and the exfiltration leg carries it out. A canonical, cheap-to-pull-off channel: get the agent to embed the stolen secret into the URL of a markdown image — when the client renders it, the secret leaks to the attacker’s server via the image request.

The lethal trifecta: private data + untrusted content + external communication. Any two are survivable; all three together enable exfiltration. Remove any one leg and the center vanishes.

The design stance the trifecta points to: safety comes from removing a leg by design, not from trusting the model to resist injection. Cut the external-communication leg (no outbound network, no auto-loaded images, egress allowlist) and even a fully successful injection has nowhere to send the data. Isolate private data from the untrusted-content path and there’s nothing to steal. You are not making the model stronger — you are making the architecture incapable of the bad outcome. So the answer to the question above is no: that system-prompt line can’t make the agent safe, because the model can’t reliably tell instructions from data in the first place.

Excessive Agency — OWASP’s name for the blast radius

The trifecta tells you when exfiltration is possible. OWASP tells you what makes the damage large. In the OWASP Top 10 for LLM Applications 2025, LLM06:2025 Excessive Agency is the sixth entry — and note that Prompt Injection sits at #1, LLM01:2025. Excessive Agency has three root causes:

Root cause	What it means	The fix
Excessive Functionality	The agent has tools that do more than the task needs (a raw shell when it only needs to read one file).	Expose only the minimal tools; prefer granular tools over open-ended ones.
Excessive Permissions	A tool holds broader rights to downstream systems than necessary — e.g. a shared admin service-account token.	Run tools in the user’s security context with minimal, short-lived scope.
Excessive Autonomy	High-impact or irreversible actions run with no human verification.	Gate sends, deletes, and transfers behind human-in-the-loop approval.

Excessive Agency is not a kind of prompt injection — it’s a distinct vulnerability that can fire from plain ambiguity or a model mistake. But it is the amplifier: when injection does land, over-permissioned and over-autonomous tools are what turn a hijacked model into an actual breach. The two are complementary, and you defend against both with the same medicine — least privilege.

You can’t patch the model into safety

It’s tempting to believe a newer, more aligned model has basically solved this, or that a good enough detector will catch the bad inputs. The evidence says otherwise. The October 2025 cross-lab paper The Attacker Moves Second — authored by a team spanning OpenAI, Anthropic, and Google DeepMind — took 12 recently published defenses that had each reported near-zero attack success, and bypassed them with adaptive attacks (gradient descent, reinforcement learning, random search, human red-teaming) at success rates above 90% for most of them. The throughline of the adversarial ML literature holds: the attacker moves second, and adapts to whatever you built.

So filtering and prompt-hardening are defense in depth — useful layers, never guarantees. The conclusion the whole field reached is the design stance: don’t try to filter injection away; engineer the system so a successful injection can’t reach a consequential action.

Designing for it: least privilege end-to-end

This is where you actually win. The mental model: remove a leg of the trifecta, and shrink the blast radius of everything that survives.

Scoped, short-lived credentials; per-user context. Never a shared admin token. A compromise should be contained to one user’s minimal scope, and expire quickly.
Allowlisted, least-functionality tools. Expose only what the task needs. Avoid raw shell / arbitrary HTTP when a narrow tool will do.
Treat all tool output as untrusted data, not instructions. Anthropic’s guidance is concrete: put third-party content only inside tool-result blocks (never in the system prompt or plain user text), tell the model what it is and where it came from, and JSON-encode it so an attacker can’t break out of the data into an instruction context. Don’t put your own instructions in tool results — they may be ignored or flagged.
Human-in-the-loop on high-impact / irreversible actions. Gate the right actions — sends, deletes, transfers, code execution — not every step (blanket approval just trains people to rubber-stamp). This is the LangGraph human-in-the-loop pattern, applied as a security control.
Sever the exfiltration channel. Disable auto-loaded remote images and links; restrict which domains the agent may contact. This is often the cheapest single mitigation — it breaks the trifecta even if the injection succeeds.
Enforce authorization downstream. The database, the payment API, the file store should check permissions themselves — complete mediation — rather than trusting the model to restrain itself.

Several of these belong in MAF middleware as reusable guardrails, and the policy reasoning behind them is the subject of the blog post why agents need permissions.

Two patterns worth knowing by name

When you genuinely need to ingest untrusted content and hold private data and act, the architecture has to carry the security instead of the model:

Dual-LLM / quarantine (Willison, 2023): split into a privileged LLM that plans and calls tools but never sees raw untrusted content, and a quarantined LLM that ingests the untrusted content but has no tool access and returns only structured, symbolic results (labels, booleans, opaque references). Injected instructions in the data physically cannot reach the component that can act. Google DeepMind’s CaMeL (“Defeating Prompt Injections by Design”) generalizes this with a capability/provenance-tracking interpreter that refuses to pass an untrusted-derived value to an exfiltration tool — a promising research direction, though it reduces rather than eliminates risk.
The Agents Rule of Two (Meta AI, late 2025): the trifecta restated as an operating rule. Within a single agent session, satisfy at most two of: processes untrusted input, accesses sensitive data, changes state or communicates externally. Need all three? Run it under supervision — human-in-the-loop approval. It’s explicitly inspired by the lethal trifecta, and it’s the most widely cited practical rule of thumb.

Notice none of these makes the model better at resisting injection. They all make the system incapable of the bad outcome. That is the entire shift: you remove a leg by design.

This pairs directly with two neighbours in the curriculum. Memory is another untrusted-content surface — see the blog post on agent memory poisoning, where the injected instruction is something the agent wrote to its own memory and reads back later. And the tool layer itself is where most of these controls live; revisit tools for the permission and schema mechanics this lesson assumes.

In one breath

An LLM reads its whole context as one undifferentiated token stream and can’t reliably tell instructions from data — so prompt injection has no model-layer fix.
Indirect injection (hidden orders inside a fetched page, email, or tool result) is the dominant agent danger; direct injection/jailbreak is the user attacking their own guardrails.
The lethal trifecta — private data + untrusted content + external communication — is when exfiltration becomes possible; any two are survivable, all three is the exploit.
OWASP LLM06 Excessive Agency (too much functionality / permission / autonomy) is the blast-radius amplifier — distinct from injection, cured by the same least privilege.
You can’t filter or prompt-harden injection away (adaptive attacks bypass defenses

90%); engineer the system so a successful injection can’t reach a consequential action — scoped creds, allowlisted tools, severed egress, human-in-the-loop on irreversible actions, dual-LLM/quarantine, the Agents Rule of Two.

Quick check

0/3

Q1An agent reads a customer's support email (untrusted), can query the customer database (private data), and has a send_email tool (external). Which single change most reliably prevents data exfiltration via prompt injection?

Q2Why is OWASP's LLM06 Excessive Agency a *distinct* problem from prompt injection (LLM01), not the same thing?

Q3TRANSFER: You're designing a research agent that must browse arbitrary public web pages (untrusted) AND summarize them into a private internal wiki it can write to (state change). Applying the Agents Rule of Two, what's the cleanest safe design?

You now have the agent-security mental model: injection is unsolved at the model layer, the lethal trifecta tells you when exfiltration is possible, Excessive Agency sets the blast radius, and least privilege is how you remove a leg by design. The human-in-the-loop lesson is where you wire the approval gate that backstops the irreversible actions this lesson told you to fear.

Agent Security — Prompt Injection & Least Privilege

What you'll learn

Before you start

Toggle the agent's capabilities. Watch when it becomes exploitable.

Two flavours of injection

The lethal trifecta

Excessive Agency — OWASP’s name for the blast radius

You can’t patch the model into safety

Designing for it: least privilege end-to-end

Two patterns worth knowing by name

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further