Cloud LLMs Have Failure Modes Benchmarks Don't See

A 480-billion-parameter coding model collapsed into a 4,000-token degenerate loop on two of four long-form dispatches in a single session. The output was syntactically clean. The repetition was the word “incumbency” followed by variations on “procuring obtaining acquiring,” repeated for several thousand tokens before the dispatch terminated. The same prompts ran clean on a 4.6-billion-parameter model from a different vendor across all four dispatches.

None of this would have shown up in a public benchmark score. Benchmarks average across many short prompts. The failure mode here only fires on long structured outputs. The benchmark calls the larger model the better model, and that call is accurate for the average case. It is wrong for the use case that breaks it.

What the Failure Looks Like

The degenerate-loop pattern looks like this in practice. The model starts the output on-topic. It produces a paragraph or two of coherent content. Then a word or phrase from earlier in the output gets re-anchored, and from that point forward the model is producing variations on the same fragment for hundreds or thousands of tokens. The output structure (heading markers, bullet markers, section breaks) sometimes continues to look correct, which is what makes the failure invisible to a quick visual scan.

The dispatches in question were briefs that asked for long structured prose: positioning analysis, audience watering-holes, competitor teardowns. Each was 1,500 to 3,500 words of expected output. Two out of four came back as ~4,000-token outputs where the first 600 tokens were on-topic and the remaining 3,400 were the looped variation.

Same prompts on GLM-4.6 produced clean, on-topic, structurally correct output on all four runs. The smaller, supposedly less capable model was 100 percent reliable on the exact tasks where the larger model failed catastrophically.

Why Benchmarks Don’t See This

Public LLM benchmarks score average performance across a corpus of short, well-defined prompts. They report a number. The number is genuinely useful for comparing models on the average case.

Tail-risk failure modes do not show up in averaged scores. If a model fails catastrophically on one in four dispatches of a specific shape, but the benchmark corpus contains zero dispatches of that shape, the failure rate is zero according to the benchmark and 25 percent according to the actual use.

The shape that triggers this specific failure mode is long structured outputs with multiple sections and consistent formatting requirements. That shape is common in real consulting work, technical writing, and code generation across large codebases. It is rare in benchmark suites because benchmark suites are designed for repeatable scoring, which biases toward short single-section prompts.

The result is that the benchmark numbers and the production reliability numbers disagree by orders of magnitude for some models on some tasks. The model that wins the benchmark can be the model that fails the production workload.

How to Detect It Cheaply

The detection pattern is simple. Run the prompt on two backends in parallel. Compare the outputs. If they diverge in length by more than a factor of two, or if a grep -c '^###' on the two outputs returns substantially different section counts, look at the longer one for repetition.

For long structured outputs specifically, a tail-pattern check works: take the last 500 tokens of the output and run a simple frequency analysis. If a small handful of words dominate by more than 20 percent of the tail, the output is in degenerate mode and should be discarded.

Neither check is expensive. Both cost less than reading the full output to discover the failure manually. The cost of NOT checking is shipping a 4,000-token paragraph of “procuring obtaining acquiring” into a document that was supposed to be a competitor teardown.

The Operational Rule That Falls Out

Once a model’s tail-risk failure mode is characterized, the routing rule is straightforward. Use the reliable model as primary for the workload that triggers the failure. Use the larger model as a parallel second opinion when it is worth dispatching, on the assumption that some non-trivial fraction of those dispatches will need to be discarded.

In the specific case above: GLM-4.6 became the primary for long structured outputs in subsequent sessions. The 480b-parameter qwen3-coder remained available for second-opinion dispatches in parallel, but its output gets a tail-pattern check before being treated as usable, and the failure rate for the long-structured-output workload is high enough that it is no longer the default for that shape of task.

What This Generalizes To

The specific failure mode is qwen3-coder:480b in degenerate-loop. The pattern generalizes. Every cloud LLM has tail-risk failure modes that do not appear in its benchmark numbers. The set is different for each model. You will not learn what they are from a leaderboard. You will learn what they are from running real workloads against the model and watching what breaks.

A practical consequence: do not adopt a new cloud LLM as a primary backend until you have run it on the shape of work you actually do, in volume sufficient to surface failure modes that fire at one-in-four or one-in-ten rates. Benchmark scores are necessary information. They are not sufficient.

Parallel dispatch against two models on real workloads is the cheapest way to characterize a new model’s tail risks. The output divergence between the two is your signal.

The Framing

Treat published benchmarks as a hypothesis about the model’s average behavior. Treat your own parallel-dispatch comparisons as the actual measurement of how the model behaves on your work. The two are not the same number. The one that matters for what you ship is the second one.

The model that wins the benchmark and fails the production workload is the most expensive kind of cloud LLM to adopt: the one whose problems do not show up until you have already wired it into your pipeline.