Roughly 73 percent of new spatial transcriptomics projects reach for a single-cell foundation model. Most hit the same three walls trying to integrate them.
I spent 2025 trying to make scGPT work for a metabolic engineering project. The paper looked promising (Cui et al., Nature Methods, 2024). The model was on GitHub. The theory was sound. Six months later we had spent roughly $120,000 on cloud compute and still could not reproduce the authors’ results on our own data. This is not an scGPT problem. It is a reality problem most R&D teams discover too late.
The Data Preprocessing Trap
The model is not the bottleneck. Data preprocessing is. Roughly 60 percent of the effort for scGPT-family models goes into the preprocessing pipeline, and most teams dramatically underestimate what “normalize the counts” means in practice.
Preprocessing for scGPT or SATURN, in practice: raw count matrices from 10x Genomics, SCTransform for normalization, filter cells with greater than 200 mitochondrial genes, doublet detection with DoubletFinder, Harmony for batch correction, PCA, reduce to the top 2,000 highly variable genes. Each step has parameters that matter. Each one introduces artifacts if you get them wrong.
Most teams treat this as a pipeline problem. It is not. It is a biological-interpretation problem masked as a data-science problem. The mitochondrial filtering threshold depends on whether you are working with immune cells (high mitochondrial activity) or fibroblasts (low). The batch-correction strategy depends on whether your batches represent technical replicates or biological conditions. Get any of these wrong and you are feeding garbage into a sophisticated foundation model.
The paper authors spent months dialing these parameters. Their published notebooks do not show the dead ends: the run where they set mitochondrial percentage to 5 percent and lost all their macrophages, the Harmony pass that accidentally corrected out the very signal they were trying to detect. Your team does not have that institutional knowledge. You are running their code on your data and wondering why the embeddings do not separate your treatment groups.
The Hardware Reality Check
The published training-corpus scales are real. A typical biotech R&D team does not have eight A100 80GB GPUs sitting around. The decision point is not whether to use a foundation model. It is whether to fine-tune (cheap but limited) or run inference-only (free but inflexible).
Published numbers from large single-cell foundation-model groups put training-from-scratch in the neighborhood of 8× A100 80GB for ~72 hours, and fine-tuning on your own spatial transcriptomics data in the neighborhood of 4× A100s for ~12 hours. Most biotechs do not have that infrastructure. They have a couple of V100s in the cloud that they use for standard analysis pipelines.
You have two choices.
Fine-tune a smaller version of the model. This works if your domain is close to the training data: mouse brain tissue, human tumor microenvironments. It fails catastrophically if you are working with something exotic like engineered microbial communities or plant tissue. The model learns on mouse and you are asking it to understand algae. The gap is too wide.
Use the published checkpoint as inference-only. This is free and accessible from a single V100, but you are stuck with what is encoded in the model. If the checkpoint does not know about your specific cell type or treatment condition, those embeddings will be noisy. You can get basic cell-type annotation out of it, but it will not discover novel phenotypes that were not in its training corpus.
Most teams start with inference-only, realize it is too limiting for their actual research question, then scramble to justify the GPU budget. That scrambling happens three months into a six-month project timeline.
The Regulatory Wall
The “foundation model” framing borrowed from NLP does not map cleanly when an FDA reviewer sees your validation set. In NLP, the cost of a wrong word is low and we accept that foundation models will sometimes be wrong. In diagnostics, the cost of a wrong cell classification is high enough to kill a program.
I ran into this with a client developing a companion diagnostic for a CAR-T therapy. We used scGPT to identify tumor-infiltrating-lymphocyte subpopulations that predicted response. The embeddings worked beautifully. The model identified patterns that manual gating missed. Then we got to the regulatory submission, and the FDA reviewer asked the question that determined the whole program’s economics: “How do you know this model is not hallucinating cell subtypes?”
They wanted us to validate each predicted subpopulation with orthogonal methods: flow cytometry, immunohistochemistry, possibly single-cell proteomics. That is the kind of work the foundation model was supposed to replace. The foundation model accelerates discovery; it does not eliminate the validation bottleneck. For regulated applications, it can increase it, because regulators do not trust black boxes.
Define what “fit for purpose” means at engagement start, not at submission.
The Foundation Model Is Not the Moat
The right question is not “should we use a single-cell foundation model.” It is “do we have a single-cell use case that is bigger than what a well-trained PhD with scanpy can answer in two weeks?”
Often the answer is no. Differential expression, cell-type annotation, trajectory inference: these are still faster and more interpretable with traditional methods. Foundation models earn their cost when you have massive multi-dataset meta-analysis problems, or when you are integrating modalities that do not normally play well together (spatial + single-cell + proteomics).
The threshold is roughly this: if you can solve your biological question with fewer than 50,000 cells from a single experiment, use traditional methods. If you need to integrate across five or more datasets, multiple modalities, or thousands of conditions, a foundation model starts to make sense.
What separates the teams that get value from these models is not biological sophistication. It is operational discipline. Dedicated data engineers who understand both the biology and the computational requirements. Preprocessing pipelines tuned to specific cell types and conditions. GPU resources allocated before the project starts, not as an afterthought. A clear position on which regulatory regime the output will eventually face.
The foundation model is a component. The pipeline that makes it reproducible, interpretable, and fit for purpose is what delivers value. Most teams discover this after spending six months and a hundred thousand dollars. The smart ones start with the pipeline and add the model only when the scale justifies it.