Web-Search Seed, Local Synthesis

Most teams handling regulated workloads run into the same wall the first time they need a current-data answer from a local LLM. The local model has no internet. The cloud model has internet but cannot see the client data. The team’s options look like a forced trade: ship the data to the cloud model, or live without current data.

The trade is not forced. Separate retrieval from synthesis. Run the retrieval on a model with web access against the public-data part of the problem. Save the results to a local file. Then synthesize against the local file using a model that has the proprietary context but no network. Neither model ever sees both halves of the workload.

The Pattern

The mechanic is straightforward.

Phase one is retrieval. A web-enabled tool runs a small number of targeted queries against the open web. The output is plain text written to a local file. The queries do not contain anything sensitive: they are public questions about public-data topics (competitor landscape, current versions of open-source projects, recent regulatory guidance, what is at the top of a specific subreddit this month). The model behind the retrieval sees only the public queries. It writes only to the local file.

Phase two is synthesis. A local LLM reads the local file along with whatever proprietary context the synthesis requires (client data, biosequences, internal proposal drafts, internal-only knowledge bases). The synthesis model has no network. The local file is the only “fresh” data it sees. It produces the integrated output entirely on-host.

The two phases never overlap. The retrieval model never sees the proprietary data. The synthesis model never sees the open internet. The local file is the only connection between them, and it is plain text that you can read.

A Real Example

A consulting project last session needed a 2026-current competitor landscape across roughly twelve dimensions, integrated with the client’s own positioning notes (which were not for outside eyes). The naive way to do this is to dispatch a cloud LLM with web access against a brief that includes both the competitor question and the positioning notes. That approach exposes the notes.

The split-pipeline approach: four Claude WebSearch dispatches against the open competitor question, each writing its output to a tmp file. Then a local ollama dispatch of GLM-4.6 against the four tmp files plus the proprietary positioning notes. The output was a 2026-current competitor analysis integrated against the client’s actual positioning, generated entirely without sending the positioning notes to any cloud backend.

Cost on the retrieval side: four WebSearch calls, low. Cost on the synthesis side: a single ollama-cloud dispatch on a model that does not retain prompt content. Total wall: under fifteen minutes. The alternative (dispatching a single cloud agent with web access for the integrated job) would have cost more, and the cost on the privacy side was the actual blocker, not the dollars.

The Compliance Case

The same architecture answers the compliance question for regulated bio, federal, and IP-sensitive workloads. The compliance team’s worry is not “can the model help us.” It is “does the model see data it should not see.” A web-enabled cloud model that gets handed proprietary context for an integrated query sees both, and the compliance team is right to flag it.

A split pipeline with retrieval on the public side and synthesis on the on-host side answers the compliance question by construction. The cloud model only ever saw public queries. The on-host model only ever saw the file the cloud model wrote. There is no path by which proprietary context reached the cloud.

The audit trail is also legible. The cloud-side requests are logged and contain only public information. The on-host side runs against a model whose weights and inputs you control. The integration point is a plain-text file you can inspect, version-control, and redact if needed.

When This Pattern Fails

The pattern fails when the retrieval and synthesis cannot be cleanly split. If the synthesis requires the model to issue follow-up queries against the public data based on what the proprietary context says, the two phases are interleaved and you cannot separate them cleanly without exposing one side to the other.

In practice this is rare. Most workloads that look interleaved actually decompose into a public-question phase and a proprietary-integration phase if you spend a few minutes restructuring the brief. The cases that genuinely require interleaved retrieval are usually research workflows (e.g., literature search that depends on partial results to refine the next query), not consulting deliverables.

The pattern also fails when the local synthesis model is too small to do the integration job. A 7B local model will not handle a twelve-dimension competitor integration. A 30B-equivalent model on ollama-cloud will. The synthesis model has to be big enough for the job, which is a different constraint than network access.

Why This Is Cheaper Than the Alternative

Dispatching a single web-enabled cloud agent against an integrated brief costs roughly what a long cloud dispatch costs, which is non-trivial. The split pipeline costs a few short web queries plus one local synthesis dispatch. On token economics it is usually 30 to 50 percent cheaper, with the bigger win being the privacy property rather than the dollars.

The compounding benefit is that the local-file output of the retrieval phase is reusable. If next month you need the same competitor landscape against a different proprietary context, the retrieval already ran. Refresh the file with one or two new queries, run the synthesis against the new proprietary context, done. The fresh-data phase amortizes across uses.

The Framing

Cloud LLMs with web access are convenient because they collapse two jobs into one dispatch. The convenience comes with a compliance cost and a privacy cost that both bill in proportion to how sensitive the proprietary context is.

Splitting retrieval from synthesis trades the convenience for control. The control is what regulated workloads need. For the right shape of task, the split pipeline is the only architecture that ships.