Has the problem of protein folding been solved?

December 1, 2020 • 1:00 pm

One of the biggest and hardest problems in biology, which has huge potential payoffs for human welfare, is how to figure out what shape a protein has from the sequence of its constituent amino acids. As you probably know, a lot of DNA codes for proteins (20,000 proteins in our own genome), each protein being a string of amino acids, sometimes connected to other molecules like sugars or hemes. The amino acid sequence is determined by the DNA sequence, in which each three nucleotide bases in the “structural” part of the DNA sequence codes for a single amino acid. The DNA is transcribed into messenger RNA, which goes into the cytoplasm where, connected to structures called ribosomes, and with the help of enzymes, the DNA sequence is translated into proteins, which can be hundreds of amino acids long.

In nearly every case (see below for one exception), the sequence of amino acids itself determines the shape of the resultant protein, for the laws of physics determine how a protein will fold up as its constituent bits attract or repel each other. The shape can involve helixes, flat sheets, and all manner of odd twists and turns.  Here’s one protein, PDB 6C7C: Enoyl-CoA hydratase, an enzyme from a bacterium that causes human skin ulcers.  This isn’t a very complex shape, but may be important in studying how a related bacterium causes tuberculosis, as well as designing drugs against those skin ulcers:

And here’s human hemoglobin, formed by the agglomeration of four protein chains, two copies each from two genes (from Wikipedia):

Knowing protein shape is useful for many reasons, including ones related to health. Drugs, for example, can be designed to bind to and knock out target proteins, but it’s much easier to design a drug if you know the protein’s shape. (We know the shape of only about a quarter of our 20,000 proteins.) Knowing a protein’s shape can also determine how a pathogen causes disease, such as how the “spike protein” or the COVID-19 virus latches onto human cells (this helped in the development of vaccines). Here’s the viral spike protein, with one receptor binding domain depicted as ribbons:

And there are many questions, both physiological and evolutionary, that hinge on knowing protein shapes. When one protein evolves into a different one, how much does that affect shape change, and can that change explain a change of function? (Remember, under Darwinian evolution, gradual changes of sequence must be continually adaptive.) How do different shapes of odorants interact with the olfactory receptor proteins, giving a largely one-to-one relationship between protein shape and odor molecules?

Until now, determining protein shape was one of the most tedious and onerous tasks in biology. It started decades ago with X-ray crystallography, in which a protein had to be crystallized and then bombarded with X-rays, with the scattered particles having to be laboriously interpreted and back-calculated into estimates of shape. (This is how the shape of DNA was determined by Franklin and Wilkins). This often took years for a single protein. There are other ways, too, including nuclear magnetic resonance, and new methods like cryogenic electron microscopy, but these too are painstakingly slow.

Now, as the result of a competition in which different scientific teams are asked to use computer programs to predict the structure of proteins that are already known but not published, one team, DeepMind from Google, has achieved astounding predictive success using artificial intelligence (AI), to the point where other technologies to determine protein structure may eventually become obsolete.

There are two articles below, but dozens on the Internet. The first one below, from Nature, is comprehensive (click on screenshot to read both):


This article, from the Deep Mind blog itself (click on screenshot), is shorter but has a lot of useful information, as well as a visual that shows how closely their AI program predicted protein structure.

 

In a yearly contest called CASP (Critical Assessment of Structure Prediction), a hundred competing teams were asked to guess the three-dimensional structure of about a hundred sections of proteins (“domains”). The 3D structure of these domains were already known to those who worked on them, but was unknown to the researchers, as the structures hadn’t been published.

The method for how Deep Mind’s AI program did this is above my pay grade, but involved “training” the “AlphaFold” program to predict protein structures by training the program with amino-acid sequences of proteins whose 3-D structure was already known. They began a couple of years ago in the contest by training the program to predict the distance between any pair of amino acids in a protein (if you know the distances between all pairs of amino acids, you have the 3D structure). This year they used a more sophisticated program, called AlphaFold2, that, according to the Nature article, “incorporate[s] additional information about the physical and geometric constraints that determine how a protein folds.” (I have no idea what these constraints are; the procedure hasn’t yet been published but will be early next year.)

It turns out that AlphaFold2 predicts protein structure with remarkable accuracy—often as good as the more complex laboratory methods that take months—and does so within a couple of hours, and without any lab expenses! In fact, the accuracy of shape prediction wound up being about 1.6 angstroms—about the width of a single atom! AlphaFold2 also predicted the shape of four protein domains that hadn’t yet been finished by researchers.  Before this year’s contest, it was thought that it would take at least ten years before AI could be improved to the point where it was about as good as experimental methods. It took less than two years.

Here’s a gif from the DeepMind post that shows how accurately DeepFold 2 predicted two protein structures. The congruence of the green (experimental) and blue (AI-predicted) shape is remarkable.

There aren’t many cases where computers can make a whole experimental program obsolete, but this appears to be what’s happening here.

There is one bug in the method, though it’s a small one. As Matthew Cobb pointed out to me, in a few cases the sequence of amino acids doesn’t absolutely predict a protein’s shape. As he noted, “Sometimes the same AA [amino acid] sequence can have different isoforms [shapes that can shift back and forth], which can have Very Bad consequences—think of prions, in which the sequence is the same but the structure is different.” Prions are shape-shifting proteins that, in one of their shapes, can cause fatal neurodegenerative diseases like “Mad cow disease”. These are fortunately rare, but do show that the one-to-one relationship between protein sequence and protein shape does have exceptions.

Here’s a very nice video put out by DeepMinds that explains the issue in eight minutes:

We’ll have to wait until the paper comes out to see the details, but the fact that the computer program predicted the shapes of proteins so very well means that they’re doing something right, and we’re all the beneficiaries.

53 thoughts on “Has the problem of protein folding been solved?

  1. I’ve been baffled as to why this didn’t make the front (web?)page of every newspaper in the developed world. Anyway, it’s certainly huge, and a reminder of how fast things can move. It will be interesting to see what the scientists who have devoted careers to this very problem move on to do.

    1. I’ve been seeing it reported everywhere but perhaps not in mainstream newspapers. If not, it’s probably because it doesn’t have much immediate bearing on regular folk’s lives yet.

      1. That could change if the method can cast light on why proteins misfold. The accumulation of misfolded proteins is implicated in neurodegenerative diseases like Alzheimer’s.

  2. Wow, that’s a fantastic development. I have loved being an evolutionary biologist but my other dream career was in protein biochemistry. This is why.

    I hate to contradict Dr. Cobb, but this is not quite right:

    “As he noted, “Sometimes the same AA [amino acid] sequence can have different isoforms [shapes that can shift back and forth], which can have Very Bad consequences—think of prions, in which the sequence is the same but the structure is different.” ”

    Most people who study gene expression use “isoform” to mean alternative transcripts of the same protein-coding gene, where different isoforms are created by alternative splicing of exons. So same nucleotide coding sequence for two isoforms, but different amino acid sequence for two isoforms (not the same AA sequence).

    Shape-shifting by prions is something different (or at least something more specific than the general differences between isoforms of other proteins).

      1. Yes for sure. But those post-translational modifications are not called isoforms. Identical amino acid sequences with different sugar molecules attached to the protein are called glycoforms.

        1. Yes, of course. Metals, nucleic acids, carboxylic acids, even lipids can change a protein’s shape but none of those would be isoforms either. I mentioned PT mods to d*g pile on the complexity of the problem we face(d); for many proteins the primary AA sequence does not entirely describe the final configuration of the functional form(s).

    1. Mike said: Wow, that’s a fantastic development. I have loved being an evolutionary biologist but my other dream career was in protein biochemistry. This is why.

      You can be both. Amongst other things I study how protein structures evolve.

  3. The therapeutic possibilities alone make this an uber-Nobel Prize result, if it proves robust. For prion diseases such as dreaded nvCreutzfeldt–Jakob, various grave neuromotor conditions and the like, we may be looking at therapies, maybe cures or even vaccines within a generation. Absolutely fantastic, and yet another demonstration of why people should stick with science…

  4. Quite amazing.
    It’s also worth pointing out that many proteins (including enzymes and transporters) function by changing shape slightly.

  5. The method for how Deep Mind’s AI program did this is above my pay grade, …

    If I understand it rightly, the basic principle here is effectively just natural selection:

    1) Tweak neural network (= adjust node connection strengths).
    2) Use neural network to predict the protein shape.
    3) Compare to training dataset and pick variant that works best.
    4) Go to 1.

    Such AI-bots learned to be way better than any human at chess and Go, simply by playing against themselves millions of times, iterating all the time and picking the winner each iteration.

    1. I have had big arguments in the past with people about these evolutionary methods and whether they should be called artificial intelligence. In my view, the resulting systems display no intelligence whatsoever.

      1. Are you denying the benefits of evolution to all non-humans? But seriously, the term “artificial intelligence” now has many meanings as often happens when a field explodes into sub-fields. Certainly protein folding uses AI technology but isn’t Artificial General Intelligence that thinks like we do.

  6. I would be interested to know how this new program competes with FoldIt (https://en.wikipedia.org/wiki/Foldit) and other crowd-sourced attempts at solving protein folding problems. I remember reading that some people were very good at this task, though not as good as AlphaFold presumably.

    One of the biggest problems with neural networks is explainability, or the lack of it. Neural networks can be trained to do a certain job but, once they’ve done it, no one learns much about how they actually do it. It’s a problem shared by humans. It has been observed that it is pointless asking a gifted artist or chess player how they do what they do. They may give an answer but it is unlikely to be accurate and surely not close to complete. Perhaps lack of explainability won’t be a problem for protein folding as the final answer is all we need.

    1. The “explainability” is an interesting problem.

      A deep neural network is much akin to a “clever Hans” routine and doesn’t really know when and why it fails. A classic example is a DNN that was trained on images of boats, only for the users to discover that it looked after a boat like interruption of the water surface – e.g. a boat dock would be classified as “a boat”.

      I note that an attention DNN [see my longer comment] is circumventing some of the problems by looking into what it is doing, albeit still not in an understandable way.

      FWIW, I read the basic problem may have been solved some months ago. The possible solution is – you may have guessed it – having another DNN “picking its brain” and studying what the “clever Hans” actually did. E.g. was it concentrating on boats (certain pixels in an image) or on water surfaces (other pixels). “Who watches the watchmen? Other watchmen!”

      1. Yes, neural networks can’t really generalize like humans can. When we see the boats, we see a lot more structure, not just a correlation of pixels.

        If we think of neural networks as function optimization, it seems reasonable that NNs can be part of some future AI that really understands but what pieces are missing are not obvious. I believe the brain has a huge amount of innate knowledge that is going to be hard to reproduce in an AI. Attempts have been made but they are based on logic which is not how the brain works. We need to invent some other kind of representation and computational model and then fill it with all the innate knowledge accumulated by millions of years of evolution.

        1. Neural networks can generalise in the way humans can. We know this because the structure in a human that does the generalisation is a literal neural network.

          It’s simply that the ones we have created aren’t yet anywhere near as good as the one inside your skull.

          1. In theory, a neural network could do anything that a Turing Machine can do. Same for the human brain. That doesn’t mean current neural networks have much to do with our brain. Neural networks do generalize from what they are trained on but in often unpredictable ways.

    2. As far as I can see from the CASP website Foldit didn’t enter this year. In previous years they have performed in the cluster of best performing groups (along with Bakerlab, Zhang etc). Alphafold’s predictions are much much more accurate than all other groups.

  7. Absolutely fascinating. I wonder how significant the differences in structure are – how well do they need to match? Perfect?

  8. Thanks, I saw this and was immediately interested in ramifications so I’m happy this was written up!

    There will of course be caveats. Besides on the protein folding side there is also cases were predictions are well below the median.

    If anyone else was wondering what “an attention-based neural network system” is, I found this primer from the language learning examples [ https://buomsoo-kim.github.io/attention/2020/01/01/Attention-mechanism-1.md/ ]. Their reference paper show that they supplement the data with a variable-length vector that contains information on the network itself (its hidden nodes) [ https://arxiv.org/pdf/1508.04025.pdf ] – it “pays attention” on how it is performing and learns faster and better.

    As a technical side note of possible improvement, I noticed that the primer describes “self-attention-based models such as Bidirectional Encoder Representations from Transformers (BERT)”. Right before I read this my feed popped a press release on a paper that promise to massively improve those [ https://www.sciencedaily.com/releases/2020/12/201201144041.htm ].

    “”A standard BERT model these days — the garden variety — has 340 million parameters,” says Frankle, adding that the number can reach 1 billion.”

    “They experimented by iteratively pruning parameters from the full BERT network, then comparing the new subnetwork’s performance to that of the original BERT model. They ran this comparison for a range of NLP tasks, from answering questions to filling the blank word in a sentence. The researchers found successful subnetworks that were 40 to 90 percent slimmer than the initial BERT model, depending on the task.”

  9. Thanks, I saw this and was immediately interested in ramifications so I’m happy this was written up!

    There will of course be caveats. Besides on the protein folding side there is also cases were predictions are well below the median.

    If anyone else was wondering what “an attention-based neural network system” is, I found this primer from the language learning examples [ https://buomsoo-kim.github.io/attention/2020/01/01/Attention-mechanism-1.md/ ]. Their reference paper show that they supplement the data with a variable-length vector that contains information on the network itself (its hidden nodes) [ https://arxiv.org/pdf/1508.04025.pdf ] – it “pays attention” on how it is performing and learns faster and better.

    As a technical side note of possible improvement, I noticed that the primer describes “self-attention-based models such as Bidirectional Encoder Representations from Transformers (BERT)”. Right before I read this my feed popped a press release on a paper that promise to massively improve those [titled “Shrinking massive neural networks used to model language”].

    “”A standard BERT model these days — the garden variety — has 340 million parameters,” says Frankle, adding that the number can reach 1 billion.”

    “They experimented by iteratively pruning parameters from the full BERT network, then comparing the new subnetwork’s performance to that of the original BERT model. They ran this comparison for a range of NLP tasks, from answering questions to filling the blank word in a sentence. The researchers found successful subnetworks that were 40 to 90 percent slimmer than the initial BERT model, depending on the task.”

    1. Thanks for this TorBjorn. I understand the structural biology well enough, but have been struggling with the AI aspects.

  10. To the naked human eye, the illustration of a predicted vs. actual result don’t look very much alike. There are lots of obvious differences; e.g., surely it matters whether a sequence is flat or helical? Everything else in the article sounds great, but I don’t get this aspect . . .

    1. Look at the rotating image. On the left is the prediction of a protein and it’s experimental structure. The important thing is how closely blue matches green. On the right is a different structure. Again, blue matches green very closely

  11. Structural biologist and occasional CASP participant here.

    This is indeed an incredible achievement. They have massively outperformed everyone in CASP (this is the biannual competition* in structure prediction as Jerry describes), and you shouldn’t believe anything in this field unless it’s been benchmarked in CASP. So, you can trust the results as far as CASP reports them.

    A few notes of caution. The new methodology has not been published and the code is not available, so it’s hard to know what they’ve done. CASP chops proteins up into ‘prediction units’, which correspond to protein domains. These are smaller, compact folded units, and a protein can have many domains. Predicting the arrangement of these domains relative to each other is as yet an unsolved problem. Similarly, multi-protein complexes aren’t part of CASP.

    Upshot – the structure prediction problem seems to have been solved, as far as we can tell, for the set of things in CASP (single domains) which is a remarkable achievement and someone will be off to Stockholm soon to get their Nobel.

    Now if you’ll excuse me I’m off to rewrite my undergraduate lectures…

    *The founder, John Moult, says its a benchmarking experiment and gets very cross if you call it a competition.

  12. Well, actually…

    Yes this is a cool result. But this statement, summarizing Cobb’s caveat, is wrong: “ in a few cases the sequence of amino acids doesn’t absolutely predict a protein’s shape”, but not because of prions. In fact many proteins do not even have a fixed structure. I have seen estimates of up to 30% of protein sequences having intrinsic disorder. It isn’t that these regions have mutliple shapes, but that they have no shape. These disorered regions has been associated with viral virulence and cancer.

    The “lock in key” paradigm, that protein structure determines function, has been very useful. So this is a signifant result. But it leaves a significant part of the story untold. Historically, protein structure has been determined by crystallography, But proteins that don’t for rigid shapes don’t crystallize. So they have been mostly ignored. That doesn’t mean they aren’t there or aren’t important.

    We can predict disordered regions in proteins computationally—our tool even uses neural nets. That way we aren’t biased toward structure by our tool.

    So, cool result here. But let’s not ignore that it is far from a solution to the “protein structure” problem. We won’t be there until we can recognize what “protein non-structure” does.

    1. Yes, and some of those disordered segments have insidious roles, doing things like becoming part of a beta-sheet in another protein and thereby becoming tightly-bound to it, such as with the anthrax toxin.

      Also some ordinary multi-subunit enzymes (like the aldehyde dehydrogenase that enables us to consume alcoholic beverages) are held together in part by beta-sheets composed of segments from different subunit monomers.

  13. Regarding how Alphafold works, it’s worth noting that they didn’t just show a NN a bunch of sequences and known structures and let it go and solve protein folding. The method is heavily based on a deep learning implementation of direct coupling analysis, a coevolution based method for inferring contacts between amino acid positions from multiple sequence alignments. This method was developed by several protein folding groups over the last ~10 years.

    Not that this makes the work of the Deepmind folks any less impressive. It’s not a matter of AI replacing the work of human scientists, it’s a matter of AI massively amplifying the work of human scientists.

    1. This follows a pattern established by most, perhaps all, successful NN efforts. Although the NN is given much of the credit, there is a lot of human expertise expressed in the program’s architecture. One way to look at such programs is that they are like expert systems, programs that capture humans’ expertise in a field, but with NN-based function optimizers thrown into the mix. It’s hard to pin down where the intelligence resides: the hand-coded expertise embodied in the architecture that wraps the neural networks or the neural networks themselves. Regardless, we are still nowhere near having an AI do science like a human scientist.

  14. If I understand correctly:

    The computer program is trained on as many experimentally determined protein structures of some state of order/disorder as are available, each in a plethora of in vitro experimental conditions. To say the least.

    The computer program reads an amino acid sequence and outputs a three dimensional structure as best matches the training data set.

    If that is true, then what does that have to do with predicting the outcome of protein folding? The program is – if I understand correctly – predicting an outcome of an in vitro experimental determination of the input amino acid sequence – if that sequence had in fact happened to be isolated in conditions and a state which would allow any structure to be determined in the first place. The results could simply be the right answer for the wrong or unknown reasons. And this is not the first computer program to generate such results.

  15. 1. This work is not peer reviewed. The work is described on pages 22-24 in the book of abstracts here :

    https://predictioncenter.org/casp14/doc/CASP14_Abstracts.pdf

    2. The score is apparently based on the “C-alpha-IDDT” in this publication from 2013:

    Mariani, V., Biasini, M., Barbato, A., & Schwede, T. (2013). lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics, 29(21), 2722-2728.

    The older way to do it is RMSD on C-alpha carbons. It is worth noting this is one atom out of all other atoms in any given amino acid.

    I point out the top score on this histogram :

    https://media.nature.com/lw800/magazine-assets/d41586-020-03348-4/d41586-020-03348-4_18633154.jpg

    … increases almost linearly from 2014 on. It is not stated if the scoring methods are identical across the years. The Nature News blurbs fail to explain what precisely accounts for the asserted exceptional performance of the program, though mentions how it produces “poor” predictions based on NMR structures and uses “additional information” – constraints – about how a protein folds. It is unclear what this “additional information” is, when it was discovered, why it wasn’t used before, or if other groups used it before.

  16. Fascinating. I had seen little snippets here and there about “protein folding,” but your crisp, easy-to-under science writing really helped me to see what the big hoo-hah is all about. Thank you!

  17. Below is a broad overview of protein folding — and let’s emphasize, “protein folding” — not “protein folded in vitro in a plethora of expression systems except when the structure is determined by NMR” :

    “The path taken by a polypeptide through the secretory pathway starts with its translocation across or into the ER membrane. It then must fold and be modified correctly in the ER before being transported via the Golgi apparatus to the cell surface or another destination. Being physically segregated from the cytosol means that the ER lumen has a distinct folding environment. It contains much of the machinery for fulfilling the task of protein production, including complex pathways for folding, assembly, modification, quality control, and recycling. Importantly, the compartmentalization means that several modifications that do not occur in the cytosol, such as glycosylation and extensive disulfide bond formation, can occur to secreted proteins to enhance their stability before their exposure to the extracellular milieu.”

    Source:
    Authors:
    Ineke Braakman [1] , Neil J Bulleid
    Affiliations:
    [1] Cellular Protein Chemistry, Faculty of Science, Utrecht University, Utrecht, The Netherlands. i.braakman@uu.nl
    Annu Rev Biochem
    2011;80:71-99.
    doi: 10.1146/annurev-biochem-062209-093836.

    I would mention that chaperones are known to assist protein folding.

    The primary sequence certainly dictates and is necessary for the numerous steps in that process — but it does not mean it is sufficient to fold a protein. And the news release of the conference abstracts mentions nothing about the in vivo process of protein folding.

  18. A note on interpreting the figures and a distinction of “fold” from “shape”:

    The figures of proteins here show ribbon diagrams to represent one atom out of each of the amino acid side chains – the C-alpha atom. The score given to the programs – “C-alpha-IDDT” might use only this atom, or maybe the peptide bond as well, I do not know. This, along with Ramachandran angles, is useful to assess the fold of a protein, the helices and sheets, etc. But the graphics program is using only the C-alpha atom.

    The “shape” of a protein suggests that every atomic position is known – as in the famous “lock and key” hypothesis, which explains how the amino acid side chains interact chemically, or for oligomeric assembly, etc. This is not what the ribbon diagrams are showing. There are no side chains or peptide bonds displayed on the AlphaFold result. The figure shows the program has identified the fold, and has accurately placed the C-alpha atoms between two pairs of structures. That is all the AlphaFold figure shows – there are no side chains shown. It might have done a good job at that, it just is not in the figure. It might be they used something besides a C-alpha trace, but C-alpha traces are very useful and popular.

Leave a Reply