Running genomics pipelines when your cloud budget is zero
Most genomics infrastructure tutorials assume you have AWS credits or a university HPC account. I don’t. Here’s what actually works with zero budget.
Local compute first
A decent workstation gets you further than you’d expect. Multi-threaded tools like fastp, bwa-mem2, and samtools scale well with cores. If you’re disciplined about intermediate file cleanup and use CRAM over BAM, storage is manageable.
Where free cloud helps
- Google Colab — usable for short GPU jobs (ESM2 inference, small model training). Unreliable for anything long-running, but fine for prototyping.
- Galaxy Project — for standard variant calling and RNA-seq workflows. Underrated. You’re not going to publish a custom pipeline from here, but it’s good for sanity checks.
- SRA Toolkit + public datasets — if you’re building a pipeline, test on public data before asking for institution compute time.
The actual constraint
It’s not compute, it’s storage. Keeping raw FASTQs around is expensive. Build your pipeline to be reproducible from SRA accessions so you can delete and re-download rather than store.
Not ideal, but workable. The constraint forces you to be more deliberate about what you’re actually running.