Three-Way Dispatch: Write, Review, Compose

A four-PR session shipped last month landed at roughly 5,000 Claude tokens per PR. The codex agent doing the writing burned 42,000 to 60,000 per PR. An ollama review dispatch running in parallel cost another 30,000. Wall clock per PR was five to fifteen minutes including the cross-model gate. Total context I spent staying in the loop was small enough that the session never threatened the cap.

That distribution is not an optimization trick. It is what the architecture forces. The three jobs of the session were write code, review code, and compose the next prompt. They are different jobs. They run on different models.

The Cost-Tier Trap

Most LLM routing discussions frame the problem as cost. Which model is cheapest per token. Which model is fastest. Which model gives the best output per dollar.

That framing produces routing rules like “use the cheap model when you can, the expensive one when you have to.” It is a tiering discussion. It treats every dispatch as a generic call against a generic task, and asks which size of model the task needs.

This misses the actual structure of agentic work. The three things an agent session does at scale are: produce a diff, evaluate a diff, and decide what to do next. Those are not the same task at three sizes. They are three different tasks. Each one favors a different model not because of cost but because of fit.

Three Jobs, Three Models

Writing code rewards precision, instruction-following, and willingness to spend reasoning tokens up front. Codex (gpt-5-family via the codex exec interface) is the strongest fit in current models. It will burn 13 to 17 thousand reasoning tokens regardless of output size, then produce a diff that hits the spec the first time roughly four out of six dispatches. The two exceptions tend to be reasonable wrong calls based on plausible assumptions, not obvious bugs.

Reviewing code rewards a different skill: catching what the writer missed. The reviewer should not be the same model as the writer, because the reviewer’s blind spots will be the writer’s blind spots. Ollama-hosted GLM-5.1 or GLM-4.6 is good at this. The output is sometimes messy and needs cleanup, but the catches are real, and the parallel dispatch cost is low enough that you can run two or three reviewer instances at once with different prompts.

Composing the next prompt rewards context and continuity. This is Claude’s actual job in the session: read what the writer produced, read what the reviewer flagged, decide what the next dispatch should do, write the brief. The thing Claude has that the other two do not is the running thread of what has happened so far in the session. That is what gets spent on prompt composition. Roughly 5,000 tokens per PR for someone who has been doing it for a while.

The Handoff

The flow that ran across four PRs last session looked roughly like this. Claude composes a brief for codex. Codex writes the diff and commits to a worktree branch. Two ollama instances run in parallel against the diff with different review framings (one for correctness, one for code quality). Claude reads the reviewer outputs, decides whether to merge or to dispatch a follow-up, and composes either a commit message or a fix brief.

The interesting part is not the steps. It is the fact that Claude never spends context reading the diff line-by-line, never spends context evaluating whether the test names are right, never spends context generating boilerplate. All of that happens on backends whose token budgets are separate from Claude’s.

The result, measured across a four-PR session: total wall clock around 50 minutes, total Claude context burn under 25,000 tokens, four PRs landed on master.

Why This Beats Single-Agent Tiering

A single-agent tiered approach does the same work with one agent that chooses different models for different sub-steps. The model selection is the same, but the agent is doing the writing AND the reviewing AND the composition itself. Two costs follow.

First, context pollution. Every diff the orchestrator reads, every reviewer output it scans, every codex completion it inspects sits in its context window. Eventually that context fills with read-only artifacts and the orchestrator runs out of room to think. The three-way split keeps the orchestrator’s context clean because it never reads the diff directly; it reads the reviewer’s verdict.

Second, the orchestrator becomes the bottleneck. Single-agent tiering serializes. The orchestrator writes, then reviews, then composes. Three-way dispatch parallelizes. While codex writes, the reviewer’s prompt is already being prepared. While the reviewer runs, Claude is composing the next brief. The wall-clock gain is real.

When This Architecture Fails

Three-way dispatch is overkill for tasks that fit in a single orchestrator dispatch. If the work is a mechanical edit, a single-file refactor, or a doc tweak, splitting it across three backends adds setup overhead without saving anything.

It also fails when the reviewer model and the writer model are too similar in capability. If the reviewer agrees with the writer on everything because they share the same biases, the review pass is theater. That is why a different model family for the reviewer matters more than the reviewer being the strongest model in the rack.

The architecture earns its setup cost when the session will span more than three or four substantive dispatches, when the work has a write-review-revise loop in it, and when the orchestrator’s context is the constraint you care about.

The Framing

Routing by cost tier asks “how cheap can this dispatch be.” Routing by cognitive role asks “which model is built for this job.” Cost tier optimizes for the bill. Cognitive role optimizes for the session ending with a clean context and a clean diff.

In a long session, the bill is not the thing that runs out first. The orchestrator’s context is.