datarekha
Agents May 24, 2026

Tool selection at 1000 tools: routing techniques that ship

Cramming hundreds of MCP tools into your system prompt destroys both latency and accuracy. Vector retrieval, hierarchical menus, RAG-on-tools, and code mode each take a different bet. Here are the numbers, the production deployments, and the pattern that's winning.

13 min read · by datarekha · toolsroutingmcpproduction

The Model Context Protocol shipped in November 2024, and within a year every SaaS company you’ve heard of had an MCP server. Zapier’s MCP endpoint alone exposes over 30,000 actions. Composio manages 500+ integrations. Block’s Goose, Anthropic’s Claude Desktop, Continue, and Cursor all ship with the assumption that you’ll wire up half a dozen MCP servers without thinking about it.

This is wonderful. It is also a structural disaster waiting to happen, because the original MCP protocol design assumed the model would see every tool in the system prompt. At 30 tools that’s fine. At 300 it is genuinely bad. At 3,000 it is impossible.

This post is about the production patterns that have emerged to solve the “too many tools” problem. There are roughly four of them; they trade off in different directions; and one is quietly winning.

The shape of the problem

Tool descriptions are not small. A typical MCP tool with a JSON schema describing two or three parameters and a useful description runs 100-300 tokens. Multiply by N tools and you get the system-prompt overhead:

SYSTEM PROMPT TOKENS & ACCURACY VS TOOL COUNT100K50K10K0101005001000+toolsprompt tokenstool-selection accuracyaccuracy axis (inverted, right)
Schematic of the two curves agent designers care about. Prompt tokens grow linearly in N. Tool-selection accuracy degrades sharply once the catalog passes ~100 tools, with documented drops of 7-85% in long-context function-calling benchmarks.

The numbers from the research literature are not gentle. A Salesforce-led benchmark in 2025 measured function-calling accuracy as the tool catalog grew from 8,000 tokens to 120,000 tokens — performance dropped between 7.59% and 85.58% depending on model and task. Even relatively small context growth costs: researchers documented some models losing 16 percentage points of accuracy with just an extra 1,000 tokens, and up to 50 percentage points once the prompt passed 8,000 tokens.

Anthropic’s Tool Search feature, shipped in 2025, encodes the same conclusion: at scale, you cannot put all tools in the prompt. The model has to find the tools first.

The four patterns

There are roughly four production approaches. They are not mutually exclusive — most real systems combine them — but it helps to name them separately.

Pattern A: Vector retrieval over tool descriptions

The most-deployed pattern. Embed every tool’s description into a vector store at registration time. When the user query arrives, embed it, retrieve the top-K most relevant tools, and inject only those into the agent’s context.

The numbers from the research are good:

  • A vector-based semantic tool discovery paper reports a 99.6% reduction in tool-related token consumption and a 97.1% hit-rate at K=3, with sub-100ms retrieval latency.
  • Performance degrades roughly linearly with catalog size, holding up well to about 1,000 tools — beyond that, flat similarity search loses accuracy and you need a two-stage hierarchy.
  • Context-enriched embeddings (where the embedding includes example invocations, not just the description) improve retrieval accuracy by ~50% over naive descriptions.

The trap is recall on multi-hop tasks. If the user’s request requires two tools that aren’t semantically similar to the request or to each other, top-K retrieval will miss one. The classic example: “send a summary of yesterday’s GitHub issues to my Slack channel” needs both a GitHub tool and a Slack tool; if your query embedding lights up the GitHub side, the Slack tool may not make it into top-K.

The fix in production is intent expansion: a small LLM call first generates 3-5 candidate intents from the query, and each intent does its own retrieval. Composio’s “just-in-time tool loading” is structurally this — query → intent extraction → focused tool fetch.

Pattern B: Hierarchical menus

Instead of retrieving individual tools, organise tools into a tree: servers at the top, capability groups in the middle, individual tools at the leaves. The agent first picks a server (“I need GitHub stuff”), then a capability (“I need to read issues”), then the specific tool.

This is the explicit shape of Zapier’s MCP design. Their endpoint exposes a small number of meta-tools — find_action, run_action — and the agent navigates the catalog through those rather than seeing the 30,000 actions up front.

The tradeoff: extra round trips. Every hierarchical lookup costs another LLM turn, which adds latency. For deeply nested catalogs the total latency to find a tool can rival the latency of using it.

Pattern C: RAG-on-tools (with re-ranking)

Pattern A with a heavier retriever — typically: embedding-based recall of top-50, then an LLM-based or cross-encoder re-ranker to top-5. This is what Anthropic’s Tool Search does internally, and what most serious enterprise deployments evolve toward.

The benefit is materially better precision at K=3 — particularly for tasks where the right tool is semantically subtle. The cost is the re-ranker call itself, typically tens of milliseconds with a cross-encoder or a few hundred with a small LLM.

A reasonable production stack circa 2026: bge-m3 or a similar multilingual embedding for recall, a small re-ranker like BAAI/bge-reranker-v2-m3 for precision, with the entire tool catalog re-indexed nightly. Total latency for the full retrieval+rerank pipeline lands at ~80-150ms on commodity infrastructure.

Pattern D: Code mode

The bet that’s quietly winning. Instead of presenting tools as JSON schemas the model picks from, you present them as a typed API — a TypeScript module the model can import and call — and let the model write code that orchestrates them in a sandbox.

Cloudflare’s Code Mode published the benchmark numbers in 2025 that turned heads:

  • For simple single-event tasks, Code Mode used 32% fewer tokens than direct tool calling.
  • For complex 31-event tasks, Code Mode used 81% fewer tokens.
  • For 2,500+ API endpoints, code mode reduced the token footprint from 1.17 million tokens to roughly 1,000 — a 99.9% reduction.

The mechanism is subtle. Code mode doesn’t load all 2,500 endpoints into the prompt. It exposes a search() function the model uses to query the OpenAPI spec by capability area, and an execute() function that runs the resulting code in an isolated sandbox. The model writes a few lines of TypeScript that call the discovered endpoints; output of one call feeds the next without round-tripping through the LLM.

That last property is the killer feature. In conventional tool calling, every intermediate result has to travel back through the model to inform the next step. In code mode, intermediate results stay in sandbox memory; the model only sees the final result. For chained operations this is a step-function improvement.

CONVENTIONAL TOOL CALLINGCODE MODELLM call 1tool ALLM call 2tool BLLM call 3tool C3 LLM calls, 3 round trips,3 intermediate results in contextLLM call 1: write codeconst a = await toolA();const b = await toolB(a);sandboxed runtimetoolA() → … → toolB() → … → toolC()intermediate values stay in memoryLLM call 2: synthesize from final result
Conventional tool calling pays for an LLM round-trip per intermediate result. Code mode collapses the inner orchestration into a single sandboxed run, with only the final value returning to the model.

How the production deployments actually combine these

Almost nobody runs a single pattern in isolation. The shipping architectures combine retrieval and code-mode:

  • Anthropic Claude with Tool Search. Pattern C — vector recall plus re-ranking — exposes a small number of tools per turn from a large catalog. Configurable per workspace.
  • Cloudflare Agents SDK. Code mode out of the box for MCP tool calls. The model writes JavaScript; the runtime executes it; results return.
  • Composio. Pattern A (vector retrieval) with managed authentication and a unified tool catalog across 500+ integrations. Intent extraction is built in; you can ask for “tools for sending notifications” and get the relevant Slack/Discord/email tools without invoking them.
  • Zapier MCP. Pattern B (hierarchical menus) with meta-tools find_actions and run_action. The 30,000-action catalog is searchable, not pre-loaded.
  • Block’s Goose. Pattern A with an explicit extension manifest. Each extension is a small set of related tools loaded together.

The architectures differ in what they assume about the deployment substrate. Cloudflare’s code mode assumes you have a sandbox (their Workers); Anthropic’s Tool Search assumes you trust Anthropic to host the search; Composio assumes you’re fine with their hosted gateway; Zapier assumes you’re a Zapier customer already.

The tradeoff matrix

PatternBest forToken costLatency overheadWorst weakness
A: Vector retrievalMost workloads under 1K toolsLowest~80-100msMulti-tool tasks where required tools aren’t semantically similar
B: Hierarchical menusVery large catalogs (10K+)LowOne extra turn per lookupLatency on deeply nested catalogs
C: RAG + re-rankingHeterogeneous catalogs with subtle distinctionsLow~100-150msMore infrastructure to maintain
D: Code modeChained multi-tool tasksLowest, especially at scaleSandbox setupRequires runtime infrastructure; model must reason in code

Pattern A is the default. Pattern C is the upgrade. Pattern B is the specialised choice for genuinely massive catalogs. Pattern D is the overall winner once your task complexity justifies the runtime investment — and the line is moving leftward fast.

A note on tool descriptions

Whichever pattern you pick, the quality of tool descriptions turns out to dominate the quality of retrieval. This is the most under-discussed operational lever in the whole space. A well-described tool with a clear verb, two example invocations, and explicit failure modes will be selected correctly far more often than a tool whose description is “interacts with the Foo API.”

Two failure patterns we see in audits:

  • The verb-overlap problem. Twelve tools all whose descriptions start with “Search…” or “Get…”. Retrieval surfaces all of them; the model picks at random; users see flaky behaviour. The fix is prepending the target system to every description (“Search GitHub issues…” not “Search issues…”) so the embedding space separates them.
  • The orphaned-parameters problem. A tool’s description is rich but its parameter schema lists input: str with no examples. The model picks the tool correctly, then calls it with the wrong shape and gets a server error. Production rule: every parameter needs a description, every required parameter needs an example, and the description should mention the unit/format/allowed values.

The empirical impact is significant. A recent piece of research on rewriting tool descriptions reports that LLM-rewritten descriptions improve tool selection accuracy by double-digit percentages on harder benchmarks. The cheapest improvement to your agent’s tool use is usually not changing the retrieval architecture — it’s spending an afternoon rewriting your tool descriptions.

The contrarian opinion: code mode is winning

I started this post intending to give a balanced overview of four patterns, and I’m ending it convinced that code mode displaces the others within twelve months for any non-trivial agent.

Three reasons.

First, the token math is structurally better, not incrementally. A 81% reduction on complex tasks is not a tuning improvement; it’s a different cost regime. The fixed-cost gap between “model writes code” and “model picks from JSON schemas” widens with task length, not narrows. The systems most punished by current tool calling — the ones chaining 10+ tool calls — are exactly the ones the field wants to build more of.

Second, the model alignment story is great. LLMs have orders of magnitude more TypeScript and Python in their training data than they do tool-calling JSON. Asking the model to write code is asking it to do what it has practised on most. The Cloudflare blog notes that even mid-tier models become dramatically more capable when given code-mode access; the gap between Sonnet and Haiku narrows when the tooling is expressed as a familiar API.

Third, the runtime story is solved. Cloudflare Workers, Modal, Vercel sandboxes, and dozens of internal sandbox runtimes already exist for this exact use case. The “needs a sandbox” objection was real in 2024 and is largely solved infrastructure in 2026.

The honest counter-arguments: code mode is harder to debug (the sandbox run is a black box if not instrumented), harder to constrain (the model can write arbitrary code, not just predefined tool calls), and has a different failure mode (silent bugs in generated code, not just wrong tool selection). All of these are tractable. None of them outweigh a 5-10x cost reduction at scale.

Two lines worth tattooing:

  • If you have under ~30 tools, none of this matters; cram them in the prompt.
  • If you have over ~30 tools and you are not on code mode yet, you’re paying for hardware you don’t need.

The tool selection problem at scale used to feel like a fundamental architectural challenge. In retrospect it was a temporary one, caused by a tool-calling API design that pre-dated the realisation that LLMs write code better than they pick from menus. The field has corrected. The teams who corrected first are shipping.


Further reading: Anthropic’s tool use overview and Tool Search are the canonical references for pattern C. Cloudflare’s Code Mode blog is the manifesto for pattern D. Composio’s MCP gateway guide is the best practical writeup of the gateway pattern. For the underlying research on accuracy degradation, see the LongFuncEval paper.

Skip to content