DeepMind publishes A.I.-predicted structures for most human proteins in a giant leap for biology

DeepMind, the London-based artificial intelligence company, has used A.I. software to predict the molecular structure of hundreds of thousands of proteins, including almost all proteins in the human body, and published the findings in a new, freely-accessible public database.

Until now, only 17% of human proteins had structures that had been definitively determined by experimental methods. DeepMind’s A.I. system, which it calls AlphaFold, can take the genetic sequence for a protein and, from that, directly predict its structure. AlphaFold had “very high confidence” in 36% of the new human protein predictions and some level of confidence in 58% of them, DeepMind researchers said in a paper published in the scientific journal Nature Thursday.

Scientists hailed the new database as a stunning advance with a potential impact in biology perhaps second only to the publication of the human genome twenty years ago. Elizabeth Blackburn, a Nobel-prize winning molecular biologist, called DeepMind’s approach “revolutionary” and that the new database “will open new windows for the scientific community onto the biological meaning of the genome sequence.”

Researchers said the new database could speed up, potentially by years, critical areas of biological research, including the search for new drugs. In early work, DeepMind has collaborated with scientists searching for medicines to address two deadly tropical diseases, Chagas and Leishmaniasis. It has also published the structures of many proteins associated with SARS-CoV-2, the virus that causes COVID-19. But the new database is likely to have an impact across much of biology, and business. For instance, DeepMind has collaborated with researchers searching to develop new enzymes that can digest plastic.

Demis Hassabis, DeepMind’s co-founder and chief executive officer, said that the publication of AlphaFold’s structure predictions marked his company’s “biggest contribution to science to date an example of the benefits A.I. can bring to society.”

From mice to malaria

DeepMind, which is owned by Google parent Alphabet, is also making freely available predicted structures for all of the proteins found in 20 other organisms of interest to biologists, ranging from mice to the malaria parasite.

The new database will be maintained by the European Bioinformatics Institute, an international scientific body in Cambridge, England, that is part of the European Molecular Biological Laboratory. Researchers will be able to access it through a website.

DeepMind also said it intends to continue to add more protein structure predictions to the new database until it covers all 130 million genetic sequences that are known to encode for proteins from any organism and are part of a large genomics library called UniRef 90. That would be 700 times the number of structures currently stored in the Protein Data Bank, which contains all the proteins for which geometry has been experimentally verified by a process called X-ray crystallography or by electron microscopes.

While that sounds like a monumental task, DeepMind might be able to complete it in about two years. It only took AlphaFold about 48 hours to produce the 350,000 protein predictions DeepMind is initially publishing to the new database, according to Kathryn Tunyasuvunakool, a researcher who worked on DeepMind’s AlphaFold team.

Paul Nurse, a Nobel laureate geneticist who is chief executive officer of the Francis Crick Institute, said that the new database could open up what he called “a systems approach” to protein research, where researchers compare the structures of entire classes of proteins across organisms, and across the entire genome of organisms to understand how different biological functions have evolved. He said this has not been possible previously because there weren’t enough known protein structures. “We do not yet really have our head around it to be honest,” Nurse said. “The ability to look at a wide number of potential structures involved in biological processes at the same time opens up a new way of thinking that is not fully exploited yet.”

“A new place to stand”

Ewan Birney, the deputy director general of EMBL, also waxed lyrical about the possibilities of the new database. “It’s a sense that a new vista, a new place to stand on, has opened up and now many scientists need to come to that platform and look out at that new landscape and see where to go,” he said.

Proteins are the building blocks of all life. They are formed from long chains of amino acids produced and assembled inside cells according to genetic instructions found in DNA. Although the chains are produced in a linear way, proteins fold spontaneously into highly complex three-dimensional shapes. This folding pattern is determined by the laws of physics, but there are so many possible folds it is impossible, even with a supercomputer, to arrive at the correct structure in any reasonable time period by simply trying combinations.

As a result, scientists have had to rely on tricky, time-consuming and expensive experimental methods to build up a picture of a protein’s structure. The gold standard for doing so is X-ray crystallography, where a solution of proteins has to be turned into a solid crystal that is then bombarded with high-powered X-rays. The diffraction patterns made by the X-rays is then analyzed to try to discern the protein’s shape.

In 1972, Nobel laureate chemist Christian Anfinsen postulated that it ought to be possible to go directly from a protein’s DNA sequence to its structure. But the scientific and computational methods for finding correlations between known structures and genetic sequences did not exist at the time. In the early 1990s, researchers began hosting a biennial competition for software that could solve this “protein folding problem.” In 2018, DeepMind entered that contest and outperformed all other teams, but its predictions were still mostly not as accurate as X-ray crystallography. This past year, DeepMind entered again with an updated version of AlphaFold. This time, it not only dominated all other research teams, AlphaFold performed so accurately that the contest organizers declared the “protein folding problem” effectively solved.

At the time the contest results were announced in November, DeepMind promised to make AlphaFold and its structure predictions available to the wider scientific community. Hassabis told Fortune that while the company had initially considered simply creating an interface that would let researchers submit DNA sequences they were interested to AlphaFold and then get a structure-prediction back. But the company later decided that the way to have the broadest impact would be to proactively run vast numbers of gene sequences through the A.I. system itself and publish the results in a database.

This was enabled by enhancing AlphaFold’s ability to make complicated calculations more quickly, reducing the time the system took to make each structure prediction, Hassabis said. When the system could calculate a structure in just a few minutes for the average protein, and sometimes even faster, he says, DeepMind suddenly realized it would actually be possible to catalogue and publish a structure prediction for every known protein-encoding DNA sequence, especially given the vast computing power DeepMind has readily available as part of Alphabet.

Plug and play

DeepMind published another Nature paper last week detailing the machine learning methods it used to build AlphaFold. It also open-sourced AlphaFold’s software code, meaning anyone with some basic programming skills can download and run a version of it. Hassabis said this would probably be most useful for researchers who design synthetic proteins—since those are not found in nature and would not be among those initially added to the new database.

In making both of AlphaFold’s code and a vast database of its protein structure predictions freely available, the company is leaving incalculable economic value on the table. But Hassabis said the company “didn’t see it that way.”

“We are first and foremost scientists,” he said. “Our mission is to build A.I., and use that to advance science and humanity.”

Nurse said that because DeepMind had used publicly available genetic and protein databases to train AlphaFold, it was only appropriate that it was making the system’s predictions freely available to the whole scientific community.

DeepMind, which was founded in 2010, and acquired by Google four years later for a reported $600 million, is primarily a research organization. Apart from AlphaFold, it is best known for having created A.I. software that could beat the world’s best human players at the ancient strategy game Go. It loses hundreds of millions of dollars a year, costs that its parent, Alphabet, absorbs. In exchange, DeepMind supplies Google with fundamental machine learning breakthroughs that get incorporated into Google products, including the Google digital assistant and its Android operating system. But Google’s payments for these innovations do not cover DeepMind’s operating costs, according to financial records publicly available at the U.K.’s business registry, Companies House.

Hassabis told Fortune that speculation in the press that DeepMind is under increasing pressure from Alphabet to prove its commercial value is untrue. But in a press conference ahead of today’s announcement, Hassabis did not directly answer a question about whether Google’s new health division or DeepMind’s Alphabet sister companies, Verity and Calico, both of which are focused on life sciences, had been given early access to AlphaFold’s structure predictions. “This is all confidential information, but we’ve worked with many, many beta partners, I guess we call them collaborators, to get as much information back, early information back, about what would be useful for the biological community and the research community,” he said.

Hassabis said in an interview that DeepMind had not ruled out working with pharmaceutical companies and biotechnology businesses on other aspects of drug discovery or innovations involving protein-folding in the future.

Tackling a tropical disease

The Drugs for Neglected Diseases Initiative (DNDI), a non-profit drug development organization that tackles diseases prevalent in the developing world, is one of DeepMind’s AlphaFold beta partners. In one case, the initiative’s researchers had a molecule that they knew from experiments killed the parasite that causes leishmaniasis, a disease endemic in many tropical regions that kills as many as 40,000 people each year. They wanted to understand how the drug worked and whether they could improve it, said Ben Perry, DNDI’s discovery and open innovation lead. But the protein that they thought the drug targeted was impossible to image with laboratory methods. “That was a roadblock until AlphaFold came along,” he said. “Now we have a structure from AlphaFold and we can effectively start an entirely new drug discovery project based on that.”

Researchers said that AlphaFold’s new database would not necessarily eliminate the need to do lab experiments to verify protein’s structures. But Tunyasuvunakool said there had already been cases where AlphaFold’s protein-folding predictions—including those in which the system had only some confidence—that had enabled experimental scientists to work out structures that had eluded them for years. “You send out a prediction, and you can sometimes get an e-mail back from the researcher within, you know less than 24 hours,” she said. “I literally get emails with titles like success, all caps, exclamation point, exclamation point, exclamation point. So I think there’s a lot of utility to be had from these structures and not only from the ones in the very highest confidence bracket.”

While AlphaFold’s predictions are as accurate, or more accurate, than laboratory methods for determining structures for some proteins—including proteins found on the membranes of cells, which have historically been difficult to assess experimentally—its accuracy drops off substantially for others. John Jumper, the DeepMind senior research scientist who lead the AlphaFold team, said that it seems that in many cases where AlphaFold struggles it is because the protein itself is “disordered,” with no inherent structure in isolation. Many of these proteins are dynamic, with their structure changing depending on interactions with other proteins or substances in the body. Scientists estimate that about 40% of human proteins are disordered. These proteins play a critical role in many biological processes and diseases, but are difficult to target with drugs because it is hard to figure out how to configure a molecule to bind with them.

Jumper noted that the 40% estimate corresponded closely to the 42% of proteins where AlphaFold’s confidence in the accuracy of its own predictions is lower. He said that while further research was necessary to be sure, it could be that AlphaFold’s confidence score will prove to be a good tool for giving scientists a strong indication that a protein is inherently disordered. That alone, he said, would be useful to researchers, with those looking for drug targets knowing, for instance, not to waste time on that protein.

Jumper and Hassabis both said that DeepMind would continue to work on protein-folding, including seeing whether AlphaFold, or a system similar to it, can be used to predict how multiple proteins will interact and possibly bind with one another.

Pushmeet Kohli, who leads DeepMind’s efforts to apply A.I. to scientific questions, said that AlphaFold had taught the company valuable lessons in how to embed scientific knowledge into machine learning systems and that these lessons were already being applied in DeepMind’s research on topics like controlling nuclear fusion and astrophysics.

Correction, July 22: A previous version of this story misspelled the last name of EMBL deputy director general Ewan Birney.

Subscribe to Fortune Daily to get essential business stories straight to your inbox each morning.