Classify Cheap, Execute Right: A Tool-Routing Layer for a Multi-Tool AI Agent

How I built a two-tier dispatch layer for an AI agent with 100+ tools: a cheap fast model classifies intent, a registry maps tools to services, and a monolithic fallback catches everything the fast path can't handle cleanly.

An AI agent that only does one thing is easy to build well. An agent that does everything — email, task management, CRM, financial monitoring, knowledge retrieval, infrastructure control — runs into a problem that has nothing to do with the model's intelligence: if every request loads every tool definition into one model call, that call gets slower and more expensive as the tool count grows, and it grows badly. I run an agent with well over a hundred tools spread across a dozen functional domains. This is the routing layer I built to make that tractable, and the general pattern underneath it.

The actual problem isn't the model

The instinct is to hand every tool to one capable model and let it figure out what to call. That works at a small scale and degrades badly at a large one — more tool definitions in context means slower responses and higher cost on every single request, even the ones that only need one simple tool. The model doing the reasoning shouldn't also be the thing deciding which of a hundred-plus tools is relevant to a two-sentence request. Those are different jobs with very different cost profiles, and treating them as one job is where the waste comes from.

Two-tier dispatch: classify cheap, execute right

The fix is to split classification from execution. A small, fast, cheap model looks at the incoming message and does exactly one job: decide which specific tool names are actually needed — nothing else, no reasoning about the answer, just classification, capped at a couple of tool names per request. That classification result is then used to look up which backend service actually owns each tool.

The tool-to-service mapping isn't hardcoded into the router. It lives in a database table, and the router pulls it and caches it at the edge for a few minutes at a time. That distinction matters more than it sounds: adding a new domain of capability means inserting rows into a table, not redeploying the routing layer. The router doesn't need to know anything about what a "finance" or "CRM" domain even is — it just needs an up-to-date map from tool name to URL.

The fast path, and the fallback that makes it safe to ship

Once classification resolves to a service, there are two outcomes. If the classification cleanly resolves to exactly one backend service, the router dispatches directly to it — one hop, fast, cheap. If classification is ambiguous, spans multiple services, or the single-service call fails for any reason, the request falls back to a monolithic service that has every tool loaded and available. That fallback is deliberately slower and more expensive — it's the correctness backstop, not the common path.

That hybrid is the actual design decision worth calling out: optimize the common case aggressively for speed and cost, and keep a universal, always-correct fallback for everything that doesn't fit the fast path cleanly. A pure microservice router with no fallback is fragile — a bad classification or a missing registry entry just fails the request. A pure monolith is safe but never gets faster or cheaper as you add capability. This sits between those two failure modes on purpose.

Context without full retrieval

Every request also needs relevant standing context — prior facts, preferences, decisions — without re-explaining everything from scratch each time. Rather than build a full retrieval-augmented pipeline with embeddings and a vector index, the system prompt is assembled per-request by pulling a bounded, recency-ordered slice of persisted memory directly into context. It's a simpler mechanism than semantic retrieval, and at the actual scale of the problem — a bounded, curated memory store rather than millions of unstructured documents — it's the right amount of engineering, not an under-built shortcut. Reaching for a vector database here would have been solving a scale problem I don't have.

You can't tune what you can't see

The router exposes its own decision-making on demand: which tools a message was classified into, which backend actually handled it, whether it fell back and why, how large the current registry is. A routing layer that can't show its own reasoning is nearly impossible to debug once it's wrong in a way that isn't obvious from the final answer alone — the failure mode of a bad classification often looks identical to the failure mode of a fine classification hitting a broken worker, and you cannot tell those apart from the outside without the routing metadata.

The lesson

The interesting engineering problem in a does-everything agent isn't the model call that produces the final answer — that part is comparatively easy. It's the traffic-control layer in front of it: deciding fast and cheaply what a request actually needs, routing it to the smallest thing that can handle it correctly, and falling back gracefully when that guess is wrong. Most of the actual engineering time on this system went into that layer, not into prompting the model that does the reasoning.

I'm Jesse Myers — Marine veteran, 32 years in enterprise IT, now building production AI systems. This site is where I write about what I've actually built, technically, in my own words.