What AI Knows About Proteins NLP systems can be used to code amino acids.

Published

Jun 02, 2021

Reading time

2 min read

Transformer models trained on sequences of amino acids that form proteins have had success classifying and generating viable sequences. New research shows that they also capture information about protein structure.

What’s new: Transformers can encode the grammar of amino acids in a sequence the same way they do the grammar of words in a language. Jesse Vig and colleagues at Salesforce Research and University of Illinois at Urbana-Champaign developed methods to interpret such models that reveal biologically relevant properties.

Key insight: When amino acids bind to one another, the sequence folds into a shape that determines the resulting protein’s biological functions. In a transformer trained on such sequences, a high self-attention value between two amino acids can indicate that they play a significant role in the protein’s structure. For instance, the protein’s folds may bring them into contact.

How it works: The authors studied a BERT pretrained on a database of amino acid sequences to predict masked amino acids based on others in the sequence. Given a sequence, they studied the self-attention values in each layer of the model.

For each sequence in the dataset, the authors filtered out self-attention values below a threshold to find amino acid pairs with strong relationships. Consulting information in the database, they tallied the number of relationships associated with a given property of the protein’s shape (for example, pairs of amino acids in contact).
Some properties depended on only one amino acid in a pair. For example, an amino acid may be part of the protein site that binds to molecules such as drugs. (The authors counted such relationships if the second amino acid had the property in question.)

Results: The authors compared their model’s findings with those reported in other protein databases. The deeper layers of the model showed an increasing proportion of related pairs in which the amino acids actually were in contact, up to 44.7 percent, while the proportion of all amino acids in contact was 1.3 percent. The chance that the second amino acid in a related pair was part of a binding site didn’t rise steadily across layers, but it reached 48.2 percent, compared to a 4.8 percent chance that any amino acid was part of a binding site.

Why it matters: A transformer model trained only to predict missing amino acids in a sequence learned important things about how amino acids form a larger structure. Interpreting self-attention values reveals not only how a model works but also how nature works.

We’re thinking: Such tools might provide insight into the structure of viral proteins, helping biologists discover ways to fight viruses including SARS-CoV-2 more effectively.

Subscribe to The Batch