One of the biggest and hardest problems in biology, which has huge potential payoffs for human welfare, is how to figure out what shape a protein has from the sequence of its constituent amino acids. As you probably know, a lot of DNA codes for proteins (20,000 proteins in our own genome), each protein being a string of amino acids, sometimes connected to other molecules like sugars or hemes. The amino acid sequence is determined by the DNA sequence, in which each three nucleotide bases in the “structural” part of the DNA sequence codes for a single amino acid. The DNA is transcribed into messenger RNA, which goes into the cytoplasm where, connected to structures called ribosomes, and with the help of enzymes, the DNA sequence is translated into proteins, which can be hundreds of amino acids long.
In nearly every case (see below for one exception), the sequence of amino acids itself determines the shape of the resultant protein, for the laws of physics determine how a protein will fold up as its constituent bits attract or repel each other. The shape can involve helixes, flat sheets, and all manner of odd twists and turns. Here’s one protein, PDB 6C7C: Enoyl-CoA hydratase, an enzyme from a bacterium that causes human skin ulcers. This isn’t a very complex shape, but may be important in studying how a related bacterium causes tuberculosis, as well as designing drugs against those skin ulcers:
And here’s human hemoglobin, formed by the agglomeration of four protein chains, two copies each from two genes (from Wikipedia):
Knowing protein shape is useful for many reasons, including ones related to health. Drugs, for example, can be designed to bind to and knock out target proteins, but it’s much easier to design a drug if you know the protein’s shape. (We know the shape of only about a quarter of our 20,000 proteins.) Knowing a protein’s shape can also determine how a pathogen causes disease, such as how the “spike protein” or the COVID-19 virus latches onto human cells (this helped in the development of vaccines). Here’s the viral spike protein, with one receptor binding domain depicted as ribbons:
And there are many questions, both physiological and evolutionary, that hinge on knowing protein shapes. When one protein evolves into a different one, how much does that affect shape change, and can that change explain a change of function? (Remember, under Darwinian evolution, gradual changes of sequence must be continually adaptive.) How do different shapes of odorants interact with the olfactory receptor proteins, giving a largely one-to-one relationship between protein shape and odor molecules?
Until now, determining protein shape was one of the most tedious and onerous tasks in biology. It started decades ago with X-ray crystallography, in which a protein had to be crystallized and then bombarded with X-rays, with the scattered particles having to be laboriously interpreted and back-calculated into estimates of shape. (This is how the shape of DNA was determined by Franklin and Wilkins). This often took years for a single protein. There are other ways, too, including nuclear magnetic resonance, and new methods like cryogenic electron microscopy, but these too are painstakingly slow.
Now, as the result of a competition in which different scientific teams are asked to use computer programs to predict the structure of proteins that are already known but not published, one team, DeepMind from Google, has achieved astounding predictive success using artificial intelligence (AI), to the point where other technologies to determine protein structure may eventually become obsolete.
There are two articles below, but dozens on the Internet. The first one below, from Nature, is comprehensive (click on screenshot to read both):
This article, from the Deep Mind blog itself (click on screenshot), is shorter but has a lot of useful information, as well as a visual that shows how closely their AI program predicted protein structure.
In a yearly contest called CASP (Critical Assessment of Structure Prediction), a hundred competing teams were asked to guess the three-dimensional structure of about a hundred sections of proteins (“domains”). The 3D structure of these domains were already known to those who worked on them, but was unknown to the researchers, as the structures hadn’t been published.
The method for how Deep Mind’s AI program did this is above my pay grade, but involved “training” the “AlphaFold” program to predict protein structures by training the program with amino-acid sequences of proteins whose 3-D structure was already known. They began a couple of years ago in the contest by training the program to predict the distance between any pair of amino acids in a protein (if you know the distances between all pairs of amino acids, you have the 3D structure). This year they used a more sophisticated program, called AlphaFold2, that, according to the Nature article, “incorporate[s] additional information about the physical and geometric constraints that determine how a protein folds.” (I have no idea what these constraints are; the procedure hasn’t yet been published but will be early next year.)
It turns out that AlphaFold2 predicts protein structure with remarkable accuracy—often as good as the more complex laboratory methods that take months—and does so within a couple of hours, and without any lab expenses! In fact, the accuracy of shape prediction wound up being about 1.6 angstroms—about the width of a single atom! AlphaFold2 also predicted the shape of four protein domains that hadn’t yet been finished by researchers. Before this year’s contest, it was thought that it would take at least ten years before AI could be improved to the point where it was about as good as experimental methods. It took less than two years.
Here’s a gif from the DeepMind post that shows how accurately DeepFold 2 predicted two protein structures. The congruence of the green (experimental) and blue (AI-predicted) shape is remarkable.
There aren’t many cases where computers can make a whole experimental program obsolete, but this appears to be what’s happening here.
There is one bug in the method, though it’s a small one. As Matthew Cobb pointed out to me, in a few cases the sequence of amino acids doesn’t absolutely predict a protein’s shape. As he noted, “Sometimes the same AA [amino acid] sequence can have different isoforms [shapes that can shift back and forth], which can have Very Bad consequences—think of prions, in which the sequence is the same but the structure is different.” Prions are shape-shifting proteins that, in one of their shapes, can cause fatal neurodegenerative diseases like “Mad cow disease”. These are fortunately rare, but do show that the one-to-one relationship between protein sequence and protein shape does have exceptions.
Here’s a very nice video put out by DeepMinds that explains the issue in eight minutes:
We’ll have to wait until the paper comes out to see the details, but the fact that the computer program predicted the shapes of proteins so very well means that they’re doing something right, and we’re all the beneficiaries.