datarekha

AGENTS.md, Skills & Tool Calls

Three different ways to give an agent knowledge and power — always-on instructions, on-demand model-loaded skills, and runtime tool calls. What each one is, how they compose, and when to reach for each.

9 min read Intermediate Agentic AI Lesson 5 of 29

What you'll learn

  • AGENTS.md = always-on INSTRUCTIONS — a vendor-neutral Markdown file loaded every session
  • A Skill = on-demand KNOWLEDGE the model loads itself when a task matches its description
  • Progressive disclosure: only a skill's name + description sits in context until it triggers
  • A tool call = a runtime ACTION the model requests and your code (or an MCP server) executes
  • How the three layers compose, and which one to reach for when

Before you start

You already know how an agent calls a tool — it emits a function name plus JSON arguments and your code runs it. This lesson is the layer above tools. It answers a question the tools lesson doesn’t: when you have a whole project’s worth of conventions, procedures, and actions, where does each piece live, and what does it cost you in context?

The answer is three distinct mechanisms that are easy to conflate because two of them are “just Markdown.” Here is the one-line mental model to carry through the whole lesson:

AGENTS.mdINSTRUCTIONSAlways onloaded every session• conventions• build / test commands• do / don’tthe standing rulesSkill (SKILL.md)KNOWLEDGEOn demandmodel loads it when relevant• a packaged procedure• name + description• optional scriptsa manual off the shelfTool callACTIONAt runtimemodel requests, code runs it• read / write a file• hit an API• query a databasethe hands that act
Instructions you always carry • a manual you pull off the shelf when relevant • a button you press to make something happen.

1. AGENTS.md — always-on instructions

The official site (agents.md) calls it “a README for agents: a dedicated, predictable place to provide the context and instructions to help AI coding agents work on your project.” It is a plain Markdown file at the root of the repository, and it complements your README.md rather than replacing it — the human README is for humans; AGENTS.md holds the agent-facing detail (build steps, tests, conventions) that would clutter it.

Two properties define it:

  • It is vendor-neutral. The site frames it as “a simple, open format” and says: “Rather than introducing another proprietary file, we chose a name and format that could work for anyone.” It was introduced by OpenAI (it’s the format Codex reads — the Codex docs say plainly, “Codex reads AGENTS.md files before doing any work”), but it is now read by 20-plus agents — Cursor, Gemini CLI, GitHub Copilot, Jules, Devin, goose, Windsurf and more — across 60,000+ open-source projects. In December 2025 it became a founding project of the Linux Foundation’s Agentic AI Foundation, which is about as concrete as “vendor-neutral” gets.
  • There are no required fields. “AGENTS.md is just standard Markdown. Use any headings you like.” Common sections are project overview, build/test commands, code style, and testing instructions — but nothing is mandatory.

The proprietary cousins are the same idea under a different filename: Claude Code’s CLAUDE.md, Cursor’s rules (.cursor/rules), Windsurf’s .windsurfrules. They all do the job of persistent project instructions loaded every session. (Claude Code’s docs even note it reads CLAUDE.md, not AGENTS.md, and recommend bridging the two with an @AGENTS.md import or a symlink so both tools read one source of truth — vivid proof that these are the “same idea, different filename.”)

2. Skills — on-demand knowledge the model loads itself

A Skill is a folder containing a SKILL.md file that “packages instructions, metadata, and optional resources (scripts, templates) that [the agent] uses automatically when relevant.” Where AGENTS.md is always-on, a skill is conditionally activated. Two things make a skill a skill:

Required YAML frontmatter — exactly two fields. Every SKILL.md begins with name and description:

---
name: pdf-form-filler
description: Fill out PDF forms by mapping field names to values and flattening the result. Use when the user asks to complete, fill in, or populate a PDF form.
---

The constraints (from the official docs): name is at most 64 characters, lowercase letters / numbers / hyphens only; description is non-empty and at most 1024 characters. The single most important authoring rule is that the description must say both what the skill does and when to use it — because the description is the only thing the model sees at first, and it’s what the model matches the task against to decide whether to trigger the skill.

Skills are model-invoked. This is the crux of the distinction from AGENTS.md. The model itself decides to load a skill. If it judges a skill relevant to the current task, it reads the full SKILL.md into context. There is no human flipping a switch and no “always on” — it’s a runtime decision driven by the description. (Some products also let you trigger a skill by name, but automatic, description-based triggering is the defining behaviour.)

The mechanism that makes this cheap is progressive disclosure — three levels of loading:

LEVEL 1 · METADATAname + description onlyalways loaded at startup • install many skills, pay almost nothing~100 tokLEVEL 2 · INSTRUCTIONSthe full SKILL.md bodyloaded ONLY when the skill triggers • the actual procedure< 5k tokLEVEL 3+ · RESOURCESbundled files & scriptsread / run only as needed • scripts run via bash, only OUTPUT enters context~unlimited
Only the ~100-token name + description sits in context until the model triggers the skill; then the body loads. (Token figures are the docs’ rough guides, not hard limits.)

This is why “won’t installing 30 skills bloat my context window?” is a misconception: 30 skills cost ~30 small descriptions until one fires. The big body and any bundled scripts only enter context on demand — and a script’s source never enters context at all, only its output.

Skills work across Claude Code (~/.claude/skills/ for personal, .claude/skills/ for project), the Claude API, and claude.ai; Anthropic ships prebuilt ones (pptx, xlsx, docx, pdf) and open-sources more. The SKILL.md folder convention is starting to spread beyond Anthropic — OpenAI’s Codex documents an Agent Skills concept too — so treat it as an Anthropic-originated pattern that is becoming a shared one, while hedging the exact cross-vendor details.

3. Tool calls — runtime actions

The bottom layer is the one the tools lesson covers in depth, so we’ll be brief and only nail the distinction. A tool call (function calling) is when the model, mid-generation, decides it needs a capability and emits a structured request — a tool name plus JSON arguments. Crucially, the model does not run the function. As OpenAI’s docs put it, a tool call is “a special kind of response we can get from the model if it… determines that… it needs to call one of the tools we made available to it.” Your application code (or, for a server-side tool, the provider’s infrastructure) executes it and feeds the result back; then the model continues. A tool is declared with a name, a description of when to use it, and a JSON-Schema for its arguments.

MCP tools are the same primitive, just standardized so any client can discover and invoke tools exposed by any server. The MCP spec states verbatim that tools are “model-controlled, meaning that the language model can discover and invoke tools automatically.” Clients discover them with a tools/list request and invoke them with tools/call. MCP’s three primitives actually sharpen the “who decides?” axis that this whole lesson turns on:

MCP primitiveWho controls it
Toolsmodel-controlled — the LLM chooses to call them
Resourcesapplication-controlled — the client app supplies the data
Promptsuser-controlled — the user picks the template

So an MCP tool surfaces to the model exactly like a native function call. “MCP is a different thing from tool calling” is a misconception: MCP just standardizes how external servers expose those calls.

How they compose — and when to reach for each

These are layers, not competitors. They stack, and each can point down to the one below:

AGENTS.md · standing rules (always on)can point to skills: “for X, use the X skill”Skill · the just-in-time playbook (on demand)can orchestrate tools: “call read_file, then write_file”Tools / MCP · the executable actions (runtime)the bottom layer both of the above ultimately drive
AGENTS.md sets the standing rules, a Skill supplies the just-in-time playbook, and Tools are the hands that act.

The composition is real and documented: Claude Code’s docs tell you to move a multi-step procedure out of CLAUDE.md and into a skill rather than bloating the always-on file; a skill body, in turn, commonly tells the agent which tools or MCP servers to call and in what order. So the decision rule is about frequency and kind, not preference:

Reach forWhen
AGENTS.md / CLAUDE.mdA fact that must hold in every session — build/test commands, project layout, conventions, “always do X.” Keep it short; the Claude Code docs suggest under ~200 lines because longer files reduce adherence.
A SkillA repeatable, specialized procedure needed only sometimes — “how we fill our compliance PDF,” “our release checklist.” The docs say to move “task-specific instructions that don’t need to be in context all the time” here.
A Tool / MCPThe agent must do something or fetch live data — take an action, hit an API, query a DB, read/write files.

A clean way to remember the failure modes, too: if your instruction file is 3,000 lines, you’ve probably stuffed skills’ worth of procedures into the always-on layer. If you find yourself pasting the same procedure into chat every few days, that’s a skill waiting to be written. And if the agent “knows” what to do but can’t actually make the change, you’re missing a tool.

This taxonomy is the foundation for the rest of the agent-engineering lessons: the agent harness is what wires these layers together into a loop, and agent protocols are how separate agents expose capability to each other.

Quick check

Quick check

0/3
Q1An agent needs to know your team's commit-message convention on every single task. Which mechanism is the right home for it?
Q2You install 25 skills. What is the context-window cost before any of them is triggered?
Q3Transfer: a SKILL.md for 'fill our quarterly compliance PDF' triggers. Its body says to call read_file on the template, then write_file on the result. In this single request, which layer EXECUTES the file write, and what decided to load the skill?

Next

You now have the taxonomy: always-on instructions, on-demand knowledge, runtime actions. The next question is what runs the loop that loads AGENTS.md, matches a task to a skill, and dispatches tool calls — that’s the agent harness.

Practice this in an interview

All questions
How do function/tool calling and LLM agents work at a high level?

Tool calling extends the LLM's output space to include structured function invocations. The model emits a JSON object naming a tool and its arguments; the runtime executes the tool and feeds the result back as a new message. An agent is a loop that repeats this cycle — observe, think, act — until the task is complete or a stopping condition is met.

What prompt engineering techniques should every LLM practitioner know?

The core toolkit is: system prompts (role and constraints), few-shot examples (format and tone anchoring), chain-of-thought (step-by-step reasoning), and output constraints (JSON schema, stop sequences). Combining these predictably closes the gap between a capable base model and a production-ready feature.

When should you use RAG vs fine-tuning vs a long-context model?

RAG is the default for dynamic, proprietary, or frequently updated knowledge. Fine-tuning is correct when you need to change the model's behavior, format, or domain-specific reasoning style — not just its knowledge. Long-context models are appropriate when your entire knowledge base fits in a single context window and latency is acceptable.

When and how should you trigger model retraining — scheduled vs. event-driven?

Scheduled retraining is simple and predictable but wastes compute when nothing has shifted and reacts slowly when drift is sudden. Event-driven retraining ties compute to evidence — a drift alarm, a performance threshold breach, or a data volume trigger — and is more efficient at scale. Most mature systems combine both.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content