The Hidden Tax on AI Agent Swarms

You split a complex task into five agents. Each one has a focused job. The design looks clean on a whiteboard. Then you run it, and the API bill is three times what you expected, the wall clock is worse than the monolithic version you started with, and two of the agents have started arguing with each other’s outputs in ways that require a sixth agent to resolve.

This is not a corner case. It is what happens to most agentic systems once they get past the prototype stage. The reflex to decompose is strong, and it is not wrong in principle. Parallelism is real. Fault isolation is real. But decomposition carries a cost that teams consistently underestimate, and when that cost exceeds the benefit, you end up paying more for worse results.

The symptom is predictable: token spend scales roughly linearly with the number of agents you add, even when the actual work being done stays constant. You are not buying more capability. You are buying more overhead.

Why This Happens

The overhead is not random. It comes from three places, and all three compound each other.

Context coupling. Every agent you add needs to understand the work it is operating on. That context does not disappear when you split the task into pieces. It gets duplicated. If you are building a pipeline where agent one writes a plan, agent two executes a subtask against that plan, and agent three validates the output, all three agents are carrying some version of the same background knowledge. The tokens you saved by shrinking each individual task get spent again — and again — passing context into each new invocation. For a discussion of how this plays out at the task decomposition level, see What Is the Smallest Unit of Work in Agent Decomposition?.

Prompt overhead. Every agent call requires a prompt. That prompt has to establish what the agent is doing, what constraints apply, what format the output should take, and enough background to make the task tractable. For small subtasks, this scaffolding can easily consume more tokens than the work itself. You end up with agents whose context budget goes mostly to framing, not execution.

State management between handoffs. When agents pass results to each other, something has to coordinate that handoff: validate the output, reformat it, handle the case where the upstream agent produced something unexpected. That coordination logic is not free. Every handoff is a potential failure mode, and the thinner your subtasks, the more handoffs you need. Systems that decompose aggressively tend to accumulate brittle seams between agents that consume disproportionate engineering time when something goes wrong.

These three costs do not add linearly. They interact. More handoffs means more context duplication means more prompt overhead, and the failure rate at each seam multiplies the revision loops that generate even more tokens. The model choice makes this worse in a specific way: if you are using a less capable model to save money per token, you often have to decompose further to keep each task within the model’s effective reasoning window. That decomposition drives up total cost even as the per-token rate goes down. A cheaper model can end up being the more expensive option overall.

The Economic Reality

Most teams track API cost as a line item, but they do not instrument it at the task level. They see the bill go up as the system grows and assume that is the cost of more capability. Often it is the cost of more overhead.

A rough model: a naïve five-agent decomposition of a task that could be handled by a single well-prompted agent typically runs 3 to 5 times the token cost. The exact multiple depends on how much context each agent needs, how many revision loops the handoffs generate, and whether any of the agents are running models with substantial prompt scaffold (system prompts, tool definitions, output schemas). In practice, the lower end of that range assumes unusually clean handoffs with well-matched schemas. The upper end, or above it, is common when the agents need to coordinate on ambiguous intermediate outputs.

The wall-clock picture is often worse. Parallel execution helps when the subtasks are genuinely independent, but most real workflows have dependencies. If agent three cannot start until agent two finishes, and agent two is waiting on a validation loop from agent one, you are not running in parallel. You are running sequentially with extra latency at each boundary.

The context window is the primary technical constraint here, as The Context Window is a Viewport, Not a Bucket discusses. Compute is cheap and scales well; tokens are not and do not. A system that duplicates context across many agents is not respecting that asymmetry.

The teams that discover this problem late are usually the ones who built the agent count up incrementally, each step looking locally justified, until the aggregate behavior became unexplainable. Auditing it retrospectively is harder than designing for it up front.

What Good Looks Like

The right architecture for agentic systems amortizes context across the work rather than duplicating it into each agent invocation. The way this works in practice: the shared scope that every agent needs gets prepared once, as a pre-computed work product, before any agent runs. Agents consume from that shared scope rather than each reconstructing it independently.

This sounds straightforward. In implementation it requires discipline about what belongs in shared scope versus what is specific to a particular agent’s invocation. Get that boundary wrong in one direction and you are loading agents with irrelevant context. Get it wrong in the other direction and you are back to duplication.

The agent count itself is not the variable to optimize. The variable is context per agent-turn, which is a function of what each agent is being asked to carry that it could have gotten from shared scope instead. Reducing that redundancy is where the real cost savings come from.

Tiered model dispatch also matters. Not every agent in a system needs the most capable model available. Agents doing deterministic, well-specified subtasks can often run on lighter, faster, cheaper models without quality loss. The coordination and reasoning work, where model capability actually matters, can run on stronger models. Getting that split right requires knowing which subtasks are actually hard, which is a different question than which subtasks look complex on a whiteboard.

The outcome, when the architecture is right, is that total token spend goes down as agent count goes up, because the marginal cost of each additional agent is mostly execution rather than context reconstruction. Wall clock improves because revision loops shrink. The system behaves more predictably because the shared scope provides a consistent ground truth that agents are not independently reconstructing.

The Benchmark You Should Run Yourself

Before you conclude your current harness is efficient, measure four numbers:

Tokens per task. Not per agent, per completed unit of work. Sum across all agents involved.

Wall-clock per task. Total elapsed time from task submission to final output, including revision loops.

Revision-loop count. How many times does an agent’s output get rejected or need correction before the downstream agent can proceed?

Per-handoff context duplication factor. Pick two agents with a handoff between them. Measure how many tokens of shared background both are carrying. The ratio of duplicated context to unique context should be low. If it is above 50 percent, you have an overhead problem.

If your tokens-per-task scales roughly linearly with the number of agents you involve in a task, the shared-scope problem is real in your system. The benchmark is not complicated to run. Most teams just have not run it because the cost looks like a legitimate scaling curve rather than an overhead curve. They look the same from the billing console.

We have run this on production-prep workloads. The reduction in total token spend when the context duplication problem is addressed is substantial enough that it changes the model economics entirely.

Putting It Together

The decomposition reflex is not wrong. Parallelism and fault isolation are real benefits, and they are worth paying for. But paying for them without measuring what you are actually buying is how teams end up with agent swarms that cost more and deliver less than a simpler design would have.

The problem is solvable. The architecture that solves it is not complicated in concept, though the implementation discipline is real. Shared scope, pre-computed work products, tiered dispatch. The hard part is instrumentation: knowing what you are actually spending per unit of work and understanding where the overhead is coming from.

If your team is hitting this wall and you would rather work with someone who has measured it on real workloads, that is what we do at McIntosh Consulting. We help AI teams understand where their system cost is actually going and what to do about it. A conversation is a reasonable first step.

Schedule a consultation.