Biological Molecules as Language ESMFold2 approaches AlphaFold 3 performance but with an open, Transformer-based architecture

Published

Jun 26, 2026

Reading time

4 min read

Google’s AlphaFold models pioneered the task of finding the shapes of biologically active molecules, opening new pathways for drug development. An open-source team refined AlphaFold 3’s architecture using insights drawn from large language models.

What’s new: A team at the non-profit biomedical research organization Biohub and the independent AI-for-biology lab EvolutionaryScale released ESMFold2, which infers the shapes of biologically active molecules — including proteins, DNA, RNA, and molecules that bind to them — by treating their components like a natural language. Where AlphaFold 3 and ESMFold2 infer molecular shapes by considering characteristics of multiple related molecules that have been aligned for comparison, ESMFold2 can also use a separate transformer to embed individual molecules directly, in the manner of a large language model. In addition to ESMFold2, the team released its embedding model, which is called ESMC.

Input/output: Input is an amino-acid sequence that defines proteins, base-pair sequence that defines DNA or RNA, standardized text description (SMILES) of other biologically active molecules, or multiple sequence alignment (MSA) that aligns related amino-acid or base-pair sequences; output is molecular shape and error estimates
Architecture: Mixed (6.2 billion parameters)
Performance: Outperforms AlphaFold 3 and other competing models when inputs are not MSAs, comparable to AlphaFold and other competing models when inputs are MSAs
Availability: Free via website, weights available for download via HuggingFace, API via Biohub

Key insight: To analyze a given molecule, AlphaFold3 and similar models must also receive an MSA, which requires finding related molecules in existing databases and aligning them properly. But transformer-based large language models are good at producing embeddings based on large amounts of training data, and databases are available to provide vast numbers of sequences and standardized text descriptions of bioactive molecules. A transformer can be trained to embed individual molecules, and the embedding can serve as input instead of an MSA.

How it works: Given a protein, DNA, or RNA sequence or description of a bioactive molecule, ESMFold2 (i) embeds the input in three different ways, including (a) the sequence, (b) its atoms, and (c) an MSA if it receives one. (ii) It produces an embedding that represents the physical distances between amino acids, base pairs, or atoms in a molecule. (iii) It estimates the coordinates of the atoms in the input. And (iv) it estimates its error in those coordinates. It learned to perform these steps using two datasets that match existing sequences and descriptions to their shapes.

Given an input, the system produced three embeddings. (a) To embed amino-acid or base-pair sequences, the system uses ESMC, a transformer model. This model was trained to fill in masked tokens in roughly 2.8 billion sequences in three protein databases.
(b) A separate transformer embedded the atoms. (c) To embed MSAs, the system uses a pairmixer model that updates each element in a matrix based on the elements in the same row or column of the matrix. In this model, the matrix represents an MSA.
Given these embeddings, another pairmixer produces an embedding that represents the pairwise distances between amino acids, base pairs, and atoms. The embedding starts as pure noise. During training, it was refined by cycling through the model up to 6 times. (At inference, it cycles 10 times, as this delivered the best performance.)
Given the embedding of distances, embeddings of sequences and atoms, and a noisy point cloud of atoms, the diffusion model removes the noise to deduce the atoms’ positions.
Given the embeddings of distance, sequences, and atoms and the point cloud, a third pairmixer estimates various errors, including the error between the predicted distances and actual distances between pairs of atoms.

Results: The authors tested ESMFold2 using FoldBench, which includes tests of finding the shapes of biologically active molecules in various combinations. Given only proteins as input, ESMFold2 outperformed Chai-1, a molecular model that doesn’t accept MSAs. Given MSAs, it performed similarly to competing models that use MSAs, including AlphaFold3.

The team evaluated the ability, given a protein sequence, to deduce its shape according to Local Distance Difference Test (lDDT), which measures similarity between estimated inter-atom distances and ground truth, higher is better. ESMFold2 achieved 0.85 lDDT, while Chai-1 achieved 0.81 lDDT.
Testing the same capability given an MSA, ESMFold2 achieved 0.89 lDDT, the same as AlphaFold3 and Protenix-v1.
Given a protein and a DNA molecule that were bound together, the team evaluated the DockQ pass rate, or the similarity between estimated inter-atom distances and ground truth at the points where the molecules touch, higher is better. ESMFold2 (80 percent) slightly outperformed Chai-1 (71 percent). Given the same molecules plus an MSA, ESMFold2 (79 percent) matched Protenix-v1 but underperformed AlphaFold3 (82 percent).

Behind the news: ESMFold2 is an update to Biohub’s 2022 ESMFold. It’s bigger and trained on more data. In addition, its architecture incorporates top-performing components proposed in other work, notably AlphaFold3, such as a diffusion model that predicts atom coordinates and a model that estimates error in the system’s output.

Why it matters: Using a transformer — essentially a large language model — to embed molecules gives ESMFold2 the ability to process input molecules without requiring an aligned set of biologically related molecules. This capability reduces friction in biological research. It's especially important if a molecule is novel (such as a rapidly evolving viral protein) or synthetic (such as a product of synthetic biology) and if information about related molecules is scarce. Moreover, since the system has open weights, it’s freely available to scientists whatever their means or affiliation.

We're thinking: LLMs have proven the value of applying more processing at inference. ESMFold2’s distance-estimation model uses the same principle, improving performance by cycling its embedding through the model multiple times.

Subscribe to The Batch