AGENTS.md, Skills & Tool Calls
Three different ways to give an agent knowledge and power — always-on instructions, on-demand model-loaded skills, and runtime tool calls. What each one is, how they compose, and when to reach for each.
What you'll learn
- AGENTS.md = always-on INSTRUCTIONS — a vendor-neutral Markdown file loaded every session
- A Skill = on-demand KNOWLEDGE the model loads itself when a task matches its description
- Progressive disclosure: only a skill's name + description sits in context until it triggers
- A tool call = a runtime ACTION the model requests and your code (or an MCP server) executes
- How the three layers compose, and which one to reach for when
Before you start
You already know how an agent calls a tool — it emits a function name plus JSON arguments and your code runs it. This lesson is the layer above tools. It answers a question the tools lesson doesn’t: when you have a whole project’s worth of conventions, procedures, and actions, where does each piece live, and what does it cost you in context?
The answer is three distinct mechanisms that are easy to conflate because two of them are “just Markdown.” Here is the one-line mental model to carry through the whole lesson:
1. AGENTS.md — always-on instructions
The official site (agents.md) calls it “a README for agents: a dedicated,
predictable place to provide the context and instructions to help AI coding
agents work on your project.” It is a plain Markdown file at the root of the
repository, and it complements your README.md rather than replacing it —
the human README is for humans; AGENTS.md holds the agent-facing detail
(build steps, tests, conventions) that would clutter it.
Two properties define it:
- It is vendor-neutral. The site frames it as “a simple, open format” and
says: “Rather than introducing another proprietary file, we chose a name and
format that could work for anyone.” It was introduced by OpenAI (it’s the
format Codex reads — the Codex docs say plainly, “Codex reads
AGENTS.mdfiles before doing any work”), but it is now read by 20-plus agents — Cursor, Gemini CLI, GitHub Copilot, Jules, Devin, goose, Windsurf and more — across 60,000+ open-source projects. In December 2025 it became a founding project of the Linux Foundation’s Agentic AI Foundation, which is about as concrete as “vendor-neutral” gets. - There are no required fields. “AGENTS.md is just standard Markdown. Use any headings you like.” Common sections are project overview, build/test commands, code style, and testing instructions — but nothing is mandatory.
The proprietary cousins are the same idea under a different filename: Claude
Code’s CLAUDE.md, Cursor’s rules (.cursor/rules), Windsurf’s
.windsurfrules. They all do the job of persistent project instructions
loaded every session. (Claude Code’s docs even note it reads CLAUDE.md,
not AGENTS.md, and recommend bridging the two with an @AGENTS.md import
or a symlink so both tools read one source of truth — vivid proof that these
are the “same idea, different filename.”)
2. Skills — on-demand knowledge the model loads itself
A Skill is a folder containing a SKILL.md file that “packages
instructions, metadata, and optional resources (scripts, templates) that
[the agent] uses automatically when relevant.” Where AGENTS.md is always-on,
a skill is conditionally activated. Two things make a skill a skill:
Required YAML frontmatter — exactly two fields. Every SKILL.md begins
with name and description:
---
name: pdf-form-filler
description: Fill out PDF forms by mapping field names to values and flattening the result. Use when the user asks to complete, fill in, or populate a PDF form.
---
The constraints (from the official docs): name is at most 64 characters,
lowercase letters / numbers / hyphens only; description is non-empty and at
most 1024 characters. The single most important authoring rule is that the
description must say both what the skill does and when to use it — because
the description is the only thing the model sees at first, and it’s what the
model matches the task against to decide whether to trigger the skill.
Skills are model-invoked. This is the crux of the distinction from
AGENTS.md. The model itself decides to load a skill. If it judges a skill
relevant to the current task, it reads the full SKILL.md into context. There
is no human flipping a switch and no “always on” — it’s a runtime decision
driven by the description. (Some products also let you trigger a skill by
name, but automatic, description-based triggering is the defining behaviour.)
The mechanism that makes this cheap is progressive disclosure — three levels of loading:
This is why “won’t installing 30 skills bloat my context window?” is a misconception: 30 skills cost ~30 small descriptions until one fires. The big body and any bundled scripts only enter context on demand — and a script’s source never enters context at all, only its output.
Skills work across Claude Code (~/.claude/skills/ for personal,
.claude/skills/ for project), the Claude API, and claude.ai; Anthropic ships
prebuilt ones (pptx, xlsx, docx, pdf) and open-sources more. The
SKILL.md folder convention is starting to spread beyond Anthropic — OpenAI’s
Codex documents an Agent Skills concept too — so treat it as an
Anthropic-originated pattern that is becoming a shared one, while hedging the
exact cross-vendor details.
3. Tool calls — runtime actions
The bottom layer is the one the tools lesson covers in depth, so we’ll be brief and only nail the distinction. A tool call (function calling) is when the model, mid-generation, decides it needs a capability and emits a structured request — a tool name plus JSON arguments. Crucially, the model does not run the function. As OpenAI’s docs put it, a tool call is “a special kind of response we can get from the model if it… determines that… it needs to call one of the tools we made available to it.” Your application code (or, for a server-side tool, the provider’s infrastructure) executes it and feeds the result back; then the model continues. A tool is declared with a name, a description of when to use it, and a JSON-Schema for its arguments.
MCP tools are the same primitive, just standardized
so any client can discover and invoke tools exposed by any server. The MCP spec
states verbatim that tools are “model-controlled, meaning that the language
model can discover and invoke tools automatically.” Clients discover them with
a tools/list request and invoke them with tools/call. MCP’s three
primitives actually sharpen the “who decides?” axis that this whole lesson
turns on:
| MCP primitive | Who controls it |
|---|---|
| Tools | model-controlled — the LLM chooses to call them |
| Resources | application-controlled — the client app supplies the data |
| Prompts | user-controlled — the user picks the template |
So an MCP tool surfaces to the model exactly like a native function call. “MCP is a different thing from tool calling” is a misconception: MCP just standardizes how external servers expose those calls.
How they compose — and when to reach for each
These are layers, not competitors. They stack, and each can point down to the one below:
The composition is real and documented: Claude Code’s docs tell you to move a
multi-step procedure out of CLAUDE.md and into a skill rather than
bloating the always-on file; a skill body, in turn, commonly tells the agent
which tools or MCP servers to call and in what order. So the decision rule is
about frequency and kind, not preference:
| Reach for | When |
|---|---|
| AGENTS.md / CLAUDE.md | A fact that must hold in every session — build/test commands, project layout, conventions, “always do X.” Keep it short; the Claude Code docs suggest under ~200 lines because longer files reduce adherence. |
| A Skill | A repeatable, specialized procedure needed only sometimes — “how we fill our compliance PDF,” “our release checklist.” The docs say to move “task-specific instructions that don’t need to be in context all the time” here. |
| A Tool / MCP | The agent must do something or fetch live data — take an action, hit an API, query a DB, read/write files. |
A clean way to remember the failure modes, too: if your instruction file is 3,000 lines, you’ve probably stuffed skills’ worth of procedures into the always-on layer. If you find yourself pasting the same procedure into chat every few days, that’s a skill waiting to be written. And if the agent “knows” what to do but can’t actually make the change, you’re missing a tool.
This taxonomy is the foundation for the rest of the agent-engineering lessons: the agent harness is what wires these layers together into a loop, and agent protocols are how separate agents expose capability to each other.
Quick check
Quick check
Next
You now have the taxonomy: always-on instructions, on-demand knowledge,
runtime actions. The next question is what runs the loop that loads AGENTS.md,
matches a task to a skill, and dispatches tool calls — that’s the
agent harness.
Practice this in an interview
All questionsTool calling extends the LLM's output space to include structured function invocations. The model emits a JSON object naming a tool and its arguments; the runtime executes the tool and feeds the result back as a new message. An agent is a loop that repeats this cycle — observe, think, act — until the task is complete or a stopping condition is met.
The core toolkit is: system prompts (role and constraints), few-shot examples (format and tone anchoring), chain-of-thought (step-by-step reasoning), and output constraints (JSON schema, stop sequences). Combining these predictably closes the gap between a capable base model and a production-ready feature.
RAG is the default for dynamic, proprietary, or frequently updated knowledge. Fine-tuning is correct when you need to change the model's behavior, format, or domain-specific reasoning style — not just its knowledge. Long-context models are appropriate when your entire knowledge base fits in a single context window and latency is acceptable.
Scheduled retraining is simple and predictable but wastes compute when nothing has shifted and reacts slowly when drift is sudden. Event-driven retraining ties compute to evidence — a drift alarm, a performance threshold breach, or a data volume trigger — and is more efficient at scale. Most mature systems combine both.