Two molecules can contain the same types and numbers of atoms but exhibit distinct properties because their shapes differ. New research improves machine learning representations to distinguish such molecules.
What’s new: Xiaomin Fang, Lihang Liu, and colleagues at Baidu proposed geometry-enhanced molecular representation learning (GEM), an architecture and training method that classifies molecules and estimates their properties.
Key insight: Chemists have used graph neural networks (GNNs) to analyze molecules based on their atomic ingredients and the types of bonds between the atoms. However, these models weren’t trained on structural information, which plays a key role in determining a molecule’s behavior. They can be improved by training on structural features such as the distances between atoms and angles formed by their bonds.
GNN basics: A GNN processes datasets in the form of graphs, which consist of nodes connected by edges. For example, a graph might depict customers and products as nodes and purchases as edges. This work used a vanilla neural network to update the representation of each node based on the representations of neighboring nodes and edges.
How it works: The authors trained a modified GNN on 18 million molecules whose properties were unlabeled to estimate structural attributes of molecules. They fine-tuned it to find molecular properties.
- The model processed two graphs in sequence: a bond-angle graph in which nodes were bonds and edges were bond angles and an atom-bond graph in which nodes were atoms and edges were bonds between them.
- First it updated the representations of each bond in the bond-angle graph. Having learned the bond representations, it used them to represent bonds in the atom-bond graph and updated the representations of each atom there.
- Using these representations, separate vanilla neural networks learned to estimate bond lengths, bond angles, distances between each atom in the molecule, and molecular fingerprints (bit-strings that encode which atoms are connected).
- The authors fine-tuned the system on 15 tasks in a benchmark of molecular properties such as classifying toxicity and estimating properties for water solubility.
Results: GEM achieved state-of-the-art results on 14 tasks, surpassing GROVER, a transformer-GNN hybrid that learns to classify a molecule’s connected atoms and bond types but not structural attributes. For example, when estimating properties that are important for solubility in water, it achieved 1.9 root mean square error, while the large version of GROVER achieved 2.3 root mean squared error. On average, GEM outperformed GROVER on regression tasks by 8.8 percent and by 4.7 percent on classification tasks.
Why it matters: This work enabled a GNN to apply representations it learned from one graph to another — a promising approach for tasks that involve overlapping but distinct inputs.
We’re thinking: How can you trust information about atoms? They make up everything!