DeepMind unveils how it solved a 50-year-old scientific challenge that could speed drug discovery
DeepMind, the London-based artificial intelligence company, has published further details of how it solved a 50-year-old scientific challenge late last year, using A.I. software to predict the shape into which proteins would fold based solely on their genetic code.
The shape of a protein is important because it helps determine that protein’s function. Most drugs work by binding to very specifically shaped “pockets” within the structure of a protein. So knowing the exact shape of the protein can be a critical step in the development of new pharmaceuticals, and DeepMind’s breakthrough has the potential to accelerate drug discovery.
The shape of a proteins is usually determined using some kind of imaging method. One of the most accurate is X-ray crystallography, in which a solution of proteins is crystallized and then bombarded with high-powered X-rays and the resulting diffraction patterns analyzed to build up a picture of the protein. But the method is expensive, time-consuming, and sometimes fraught. More recently, other methods have been used, such as flash-freezing the proteins at extremely low temperatures and then examining them in electron microscopes.
But back in 1972, Nobel laureate chemist Christian Anfinsen postulated that it should be possible to accurately predict the exact shape a protein will fold into just by looking at its DNA sequence. At the time, however, the computational methods, the gene sequencing techniques, and just as important, the computing power, to work out such complex correlations did not exist.
A biennial contest for software that could accurately predict protein structure from genetic sequences, called the Critical Assessment of Protein Structure (or CASP) competition, began in 1994. In 2018, DeepMind—which is owned by Google parent-company Alphabet—entered the competition for the first time using a deep-learning system, a kind of artificial intelligence that uses neural networks: software that is loosely based on the way connections in the human brain work. DeepMind’s system, which it called AlphaFold, handily beat all the other teams, making a big leap forward in prediction accuracy, although it was still far from equaling the accuracy of X-ray crystallography.
Last year, DeepMind entered again with a redesigned deep-learning system, AlphaFold 2. This time it was able to make predictions that were so accurate across most protein types that not only did the A.I. company’s team win the contest, the CASP organizers themselves declared that DeepMind had essentially solved the protein structure prediction problem as Anfinsen had first formulated it.
Today, in a peer-reviewed paper published in the prestigious scientific journal Nature, DeepMind offered further details of how exactly its A.I. software was able to perform so well. It has also open-sourced the code it used to create AlphaFold 2 for other researchers to use.
The company has said previously that it may develop an interface that would allow academic researchers and possibly even pharmaceutical companies to simply query AlphaFold 2 for protein structure predictions, but the company has not yet announced any such access. Having the source code would still require non-DeepMind scientists to train the neural network themselves before they could derive useful protein structure predictions.
“We pledged to share our methods and provide broad, free access to the scientific community,” Demis Hassabis, DeepMind’s cofounder and chief executive officer, said in a statement. “Today we take the first step toward delivering on that commitment.” Hassabis promised to
share more updates “soon” on the company’s progress toward making AlphaFold2’s predictions more widely available.
In its Nature paper, DeepMind wrote that AlphaFold 2 has already helped those who study X-ray crystallography and electron microscope images of proteins to better refine their understanding of what they are seeing in that data. The system has also already proven that it can accurately predict the shape of some key proteins associated with SARS-CoV-2, the virus that causes COVID-19.
The design of the neural network used in AlphaFold 2, according to the Nature paper, is complicated. It consists of two large modules that work together to create a prediction of a protein’s structure.
The first module, which DeepMind calls Evoformer, takes in both the protein’s raw genetic sequence and data about which parts of that DNA code have co-evolved with those found in other proteins for which there is a known structure. The Evoformer then represents the data as a graph, in which the nodes of the graph are amino-acid pairs and the edges of the graph represent the proximity of those pairs to one another in the protein. This Evoformer has 48 neural network “blocks,” each of which might consist of multiple layers of the network.
Each of these blocks performs a series of manipulations of this graph, using a variety of state-of-the-art machine-learning techniques, before passing its prediction along to the next block for further revision. In this way, the entire Evoformer gradually refines a forecast for what the backbone of the protein should look like. Some of the techniques the system uses are similar to those that underpin recent breakthroughs in natural language processing.
The Evoformer then passes its prediction to a second module, called the Structure Prediction Module. Consisting of eight more neural network blocks, it performs a series of geometric transformations to further refine the protein’s likely shape. In particular, this module builds up a picture of the protein’s likely “side chains,” which in abstracted 3D images of proteins appears as twisty, ribbonlike curlicues that branch off from the main protein backbone.
DeepMind noted in its paper that while AlphaFold 2 achieved accuracy to within a fraction of an atom’s width of distance for a majority of known protein structures, there were still some areas where it struggled. For proteins where there were fewer than 30 genetic sequences that are known to have co-evolved across proteins, AlphaFold’s accuracy dropped substantially. DeepMind said it thought this co-evolution information was “needed to coarsely find the correct structure in the early stages of the network.”
The researchers also said the system did not perform as well for certain kinds of proteins where their shape is largely determined by interactions between the side chains rather than along the backbone, or that consisted of the intertwining of two very different amino-acid chains. But the scientists also wrote that “we expect” the same ideas used in AlphaFold will be able to accurately predict such complex protein bindings in the future, hinting that perhaps DeepMind has already made progress on this problem behind the scenes.
Subscribe to Fortune Daily to get essential business stories straight to your inbox each morning.