We Trained mRNA Language Models Across 25 Species for $165

We Trained mRNA Language Models Across 25 Species for $165

7 0 0

Imagine going from a therapeutic protein concept to a synthesis-ready, codon-optimized DNA sequence in an afternoon. That’s what OpenMed set out to build. And they did it for $165.

This isn’t a polished corporate success story. It’s a transparent account of what worked, what surprised them, and what they’d do differently. Full results and runnable code at every step.

What They Built

The pipeline has three stages: predict the 3D structure of a protein, design amino acid sequences that fold into that structure, and optimize the underlying DNA codons so the protein actually expresses in the target organism. Structure prediction uses ESMFold from Meta, sequence design uses ProteinMPNN from the Baker Lab. The codon optimization piece is entirely theirs: new models, new training infrastructure, new evaluation metrics.

Protein folding on 30 chains gave an average pTM of 0.79. Sequence design on scaffold 7K00 recovered 42% of the original sequence. Those are solid numbers for established tools.

The mRNA optimization work is where the real effort went. And where the interesting results are.

Why Codon Optimization Matters

The genetic code is degenerate: the same protein can be encoded by astronomically many different DNA sequences. But some codon arrangements express 100x better than others. The Pfizer-BioNTech COVID vaccine was codon-optimized for human expression. Most tools rely on hand-crafted frequency tables. OpenMed wanted a model that learned preferences directly from natural coding sequences.

The Architecture Shootout

The core question: which transformer architecture works best for codon-level language modeling? Codons are triplets drawn from a small 64-token alphabet, with strong positional dependencies and species-specific usage biases. Not quite natural language, not quite amino acid sequences.

They started with a tiny CodonBERT baseline (6M params, following Sanofi’s published architecture) and scaled up through two families: ModernBERT (the latest efficiency innovations from NLP) and RoBERTa (the proven workhorse behind Meta’s ESM protein language models).

  • CodonBERT baseline: 6M params, BERT-tiny, just to establish floor performance
  • ModernBERT-base: 90M params, 22 layers with RoPE and efficient attention
  • CodonRoBERTa-base: 92M params, 12 layers, same family as ESM-2
  • CodonRoBERTa-large: 312M params, 24 layers, test whether more parameters help
  • CodonRoBERTa-large-v2: 312M params, same architecture with refined hyperparameters

The choice of RoBERTa was deliberate. Meta’s ESM-2 (which powers ESMFold) is itself a RoBERTa variant trained on protein sequences. The hypothesis: the same architecture that learned amino acid patterns might also learn codon patterns.

The Results

CodonRoBERTa-large-v2 won decisively: perplexity of 4.10 and a Spearman CAI correlation of 0.40. That significantly outperformed ModernBERT. Not close.

Then they scaled to 25 species. They trained 4 production models in 55 GPU-hours. Total cost: $165. That’s absurdly cheap for what they got.

The multi-species system is species-conditioned: you tell it which organism you’re expressing in, and it optimizes codons accordingly. No other open-source project offers this.

Where This Stands

This is higher than I expected for a $165 experiment. The perplexity numbers are genuinely good. The CAI correlation of 0.40 means the model is learning real codon usage biases, not just memorizing patterns.

But let’s be honest: this is a first pass. The training data was 381k coding sequences across 25 species. That’s not nothing, but it’s also not the full diversity of the tree of life. The model hasn’t been validated in wet-lab experiments. Perplexity and CAI correlation are proxies, not guarantees of actual expression levels.

Still, for an open-source project that cost less than a nice dinner for two, this is impressive. The pipeline is reproducible. The code is available. Anyone can take this and run with it.

What’s Next

OpenMed plans to expand to more species, test with actual expression data, and release the models on Hugging Face. If they can maintain this pace, they’ll have something genuinely useful for therapeutic mRNA design within months.

I’ll be watching this one. The combination of low cost, open-source ethos, and solid engineering is exactly what this field needs.

Comments (0)

Be the first to comment!