Five Models, One Spec: Notes from Routing the Same Work to Different LLMs

If you only use one large language model, you optimize around its idiosyncrasies until they stop looking like idiosyncrasies. Switch to a second one and the differences re-surface — but now you have a baseline. Run the same well-specified task through four or five models in parallel and you stop having opinions about models. You have evidence.

Over the last few sessions I have been doing exactly this: writing a tight, structured prompt — usually a single deliverable like a bash script or a markdown skill file with every constraint enumerated — and dispatching it to GPT-5 (via the codex CLI), GLM-5.1, GLM-4.6, and DeepSeek-V3.2 in parallel. Then comparing the outputs side by side, fixing whatever each one got wrong, and shipping. The fixing patterns are not random. Each model has a signature failure mode that shows up across tasks, and once you know what to look for you can route work to the model most likely to nail it on the first pass.

This is a working set of rules, not a benchmark study. The sample size is dozens of dispatches, not hundreds. But the patterns have been consistent enough that I now route by them rather than by intuition.

The single dominant failure mode for the GLM family: reasoning leakage

GLM-5.1 has a visible “thinking” mode. It will produce a draft, notice its own bugs, write meta-commentary about what it should fix, and then emit a corrected version. All of this lands in stdout. On a recent dispatch — a 220-line bash script for pruning merged git branches — the model emitted both the buggy first draft and the corrected version, with about 40 lines of “wait, I need to double-check…” between them. Extracting the corrected script meant identifying the second #!/usr/bin/env bash and slicing from there.

This is not a defect of the prompt. The prompt explicitly said “output only the script contents.” The thinking-mode output is just what GLM-5.1 produces when it is being thoughtful. The fix is post-processing, not prompt engineering — strip the reasoning block before treating the output as canonical.

GLM-4.6 has a different problem: it does not reason visibly, but it also does not fully transform prompt scaffolding into final content. If your prompt says “include a short intro (3-4 lines)” and “a bulleted list of 4-5 triggers,” GLM-4.6 will sometimes leave the literal strings “A short intro:” and “A bulleted list of 4-5 triggers:” in the body, as if it were halfway through a draft. The structure is there, the content is mostly there, but the prompt’s instructions about the form leak through into the form itself.

The corrective for both is the same: never ship a GLM output without reading it end-to-end. They produce good content, but they also produce content that looks done before it is.

Codex is precise and expensive

GPT-5 via the codex CLI hits the spec almost exactly. On the same six tasks where GLM needed post-processing, codex produced shippable output four out of six times, and the two exceptions were both correct relative to plausible assumptions — it imported ClaudeAgent from the wrong submodule path because it assumed a re-export, and it used pytest markers (network, slow) that turned out not to be the convention in that specific project. Both were corrections that would have been the right call in most codebases.

The cost is token consumption. Each codex dispatch in this batch ran 13-17K reasoning tokens, regardless of output size. The shortest output (a 90-line Python hook) cost 14.7K tokens. The longest (a 220-line bash driver) cost 13.9K. Codex pays its reasoning budget up front and produces clean output; the GLM family pays a much smaller token cost and produces output that needs review. Neither is strictly better. The right tradeoff depends on how much you trust your review process versus how much you trust the model.

DeepSeek-V3.2 sits between the two. The sample size for DeepSeek in this batch is one dispatch — a small ingest helper script — and it produced a working script with a single idiomatic bug: it used ${PYTHON_VENV:-python3} to pick a Python interpreter, which falls back only when the variable is empty or unset, never when the path simply does not exist on disk. That is the kind of bug a Python programmer writing bash makes; a bash programmer would have used [[ -x "$path" ]]. The output was 95% right, and the missed 5% was conceptual rather than structural.

The dispatch loop has its own failure modes

The wrapper layer matters as much as the model. Two failure modes show up regardless of which model you are talking to.

The first: both pi (the multi-provider client) and codex exec block on stdin EOF in non-TTY contexts. If you invoke them from a script without explicitly closing stdin, they will hang forever waiting for an EOF that never arrives. On a development box where you cannot attach a terminal, this looks like the LLM stalled. It is not the LLM. It is the wrapper waiting for input it will never receive. The fix is one redirect: < /dev/null. The cost of not knowing this is real — a previous session lost approximately fifty minutes of wall clock to two separate hangs of this exact shape before the lesson got captured. The fix is so small that it deserves a hook: anytime you write a Bash command that invokes either tool, append the redirect automatically.

The second: backgrounded subprocesses do not die when their wrapper does. If you spawn a long-running model dispatch via nohup ... & disown and then stop the wrapper, the actual model process re-parents to PID 1 and continues running. On API-paid models this consumes quota for as long as the model wants to keep going, which can be tens of minutes after you thought you had stopped it. The notification you get from the orchestrator says “completed” but it is reporting on the wrapper, not the model. Verify with ps.

When to route what

A working set of rules, given the patterns above:

Codex for any production Python or bash script where correctness is load-bearing and the spec is detailed. Specifically: subprocess shims, test files with structural assertions, anything that pins an API surface, anything where a silent bug costs more than the extra tokens.

GLM-5.1 when you want a self-correcting pass and you are willing to post-process. Useful for medium-complexity bash where you want the model to catch its own off-by-one errors before you do.

GLM-4.6 for high-volume structured prose: skill documentation, markdown templates, anything where the shape is fixed and the content is the variable. Faster than GLM-5.1, more prone to placeholder leakage, but the output is recoverable with a single cleanup pass.

DeepSeek for bash-plus-Python glue, given a strong reviewer downstream. One data point is not enough to recommend it for production-grade work without further use.

The cross-cutting rule: never run a parallel dispatch without an output review step. The whole point of dispatching to multiple models is that each one will get something wrong. The cost of the dispatch is the parallel API calls; the cost of not reviewing the outputs is the bugs you ship because one of the models was confident and wrong.

A note on token economics

A naïve reading of this is “codex costs 10x more than GLM, so use GLM.” That is wrong in both directions. Codex tokens are billed against a subscription, not per-call, and the reasoning budget is amortized across what would otherwise be a review-and-rewrite loop on a smaller model. GLM tokens are cheaper but the review-and-rewrite happens in your time, which is the most expensive resource in the system. The actual economic calculation is: what does it cost you, in wall clock and attention, to clean up the output? For complex deliverables that question matters more than the API bill.

This is the same trap as the agent-swarm decomposition pattern I wrote about last week. Cheaper-per-call is not cheaper-per-task. Decomposing a task across many small cheap models can cost more than running it once through a model that gets it right on the first pass, because the cleanup cost is not paid by the API.

The takeaway

Use the model that nails your spec on the first pass, not the model that is cheapest per token. For most well-specified tasks today that means codex. For prose where the shape is more forgiving, GLM-4.6 is faster and the post-processing cost is acceptable. For everything else, run a parallel dispatch and pick the best output — but only if you trust your review process to catch what each model gets wrong. The interesting question is not “which model is best.” It is “which failure mode am I prepared to handle.”