Why I Run Three LLM Backends and Mostly Use the Cheapest One

Most teams pick one LLM backend and stick with it. That is backwards.

I run three backends daily: Claude Opus at the top, GLM-5.1 (sometimes through ollama-cloud’s GLM-4.6) in the middle, and a local 7-billion-parameter qwen2.5-coder running on the RTX 3050 in my dev box. The interesting number is not which model is “best.” It is what fraction of your actual work needs that model. In my practice it is usually 5 to 15 percent.

Three Tiers in Practice

Tier 1, peak reasoning. Claude Opus, or codex on the GPT-5 family, handles complex refactors, integration glue, architectural decisions. This is where the model’s reasoning power matters. These calls account for roughly 10 percent of my total LLM usage.

Tier 2, bulk coding work. GLM-5.1, or ollama-cloud’s GLM-4.6, handles standard feature development, test writing, and documentation generation. The quality is sufficient for most tasks, and rework is cheap when it fails. About 50 percent of calls.

Tier 3, mechanical edits, local. qwen2.5-coder:7b on the local GPU handles single-file mechanical edits, formatting, scratch utilities, and simple transformations. The model runs instantly, costs nothing per call, and handles roughly 40 percent of total usage.

The distribution is not academic. I have tracked it over weeks of real development work and the pattern holds: you rarely need the expensive model for more than a small fraction of your actual tasks.

Real Numbers from a Live Session

Last week I shipped roughly 150 lines of production code across four pull requests. The work involved reading existing codebases, writing new components, adding tests, and updating documentation. Total context-window usage across all three backends: roughly 150,000 tokens.

How much of that went through the top-tier model? About 30,000 tokens, or 20 percent of the total context budget for the entire session. The rest got routed strategically: read-heavy exploration and test generation went to a codex subagent, formatting and mechanical edits went to the local 7b, and bulk coding tasks went to GLM-5.1.

The top-tier model never saw most files. It did not need to read the 15 test files that were generated. It did not process the 8 formatting passes. Those tasks consumed roughly 80 percent of the total work but ran on the cheapest backends available.

Context Pollution Is the Real Cost

The hidden expense of using a top-tier model for everything is not the per-call price. It is context-window pollution. Every diff that model reads, every test output it scans, every file it opens: that token spend competes with the deep reasoning you need it for. Use Opus for formatting tasks and you are burning the budget you need for architectural decisions. Context windows are finite. The high-end model’s most precious resource is not API credits; it is context capacity.

The Fix: Dispatcher Mode

The solution is not to pick a cheaper model. It is to route work by fit.

Claude becomes the strategist. It looks at the task and decides where it belongs: “deep reasoning, I will handle it,” “standard CRUD work, dispatch to GLM-5.1,” “formatting, dispatch to the local 7b.”

The dispatcher does not replace the high-end model. It preserves the high-end model’s capacity for the work that benefits from its capabilities. Claude’s role shifts from doing everything to directing everything.

This requires structure: your tooling needs to understand which backends are available, what each does well, and how to route tasks appropriately. The investment pays for itself in saved context capacity alone.

The Right Framing Question

The “best model” framing is backwards. The right question is which model has the best fit for this task at the cheapest acceptable quality.

The team that runs three backends and mostly uses the cheapest one ships faster than the team that runs one backend at premium. Not because cheap models are somehow magically better, but because matching model capability to task requirements preserves the high-end model’s capacity for the work that needs it. Teams that committed entirely to a top-tier model get throttled by context limits on complex projects, while teams using a tiered approach keep the high-end model available for the parts that matter.

Your Bill Is Your Failure Rate

Look at your API bill. If the most expensive model line dominates your costs, you are not making optimal routing decisions. If that line sits under 15 percent of your total spend, you are probably dispatching correctly.

The bill on the most expensive model line is your dispatch failure rate made visible.