{"title":"Diffusion Evolution: New Artificial Intelligence Models Break Barriers in Protein Design","authors":"Fay Lin","doi":"10.1089/genbio.2023.29114.fli","DOIUrl":null,"url":null,"abstract":"GEN BiotechnologyVol. 2, No. 5 News FeaturesFree AccessDiffusion Evolution: New Artificial Intelligence Models Break Barriers in Protein DesignFay LinFay LinE-mail Address: [email protected]Senior Editor, GEN BiotechnologySearch for more papers by this authorPublished Online:16 Oct 2023https://doi.org/10.1089/genbio.2023.29114.fliAboutSectionsPDF/EPUB Permissions & CitationsPermissionsDownload CitationsTrack CitationsAdd to favorites Back To Publication ShareShare onFacebookTwitterLinked InRedditEmail Diffusion models, a form of generative artificial intelligence, are a rising tool for protein design, showing improved experimental success and new potential for biotechnological applications.This protein fold is one of thousands designed from scratch using new machine learning methods. (Credit: Ian C. Haydon/UW Institute for Protein Design)In July 2023, scientists in David Baker's laboratory at the University of Washington (UW) published a report in Nature detailing a new deep-learning framework for de novo protein design called RoseTTAFold diffusion (RFdiffusion), in Nature.1 Since then, the scientific community has been buzzing about RFdiffusion's unprecedented experimental success rate and ease of use.David Juergens, a graduate student in Baker's laboratory and one of seven co-lead authors of the Nature article, shared an anecdote about a scientist working in a lab in China, who posted on social media how “they designed a protein in a browser, ordered the sequence, purified the protein, crystallized it, and then got a crystal structure that was half an angstrom away from the design that was on the computer. It was amazing!” Juergens told me.David Baker, Professor in Biochemistry and Director of the Institute for Protein Design at UWSome of the applications of RFdiffusion, documented with experimental validation in the Nature article, include design of symmetric oligomers for vaccine platforms and delivery vehicles and generation of high-affinity binders for therapeutics.1 In another project, the Baker laboratory has applied RFdiffusion to design proteins that bind peptide hormones—established biomarkers for clinical care and biomedical research—for diagnostic applications.2Box 1. Let's Generate interactionsGenerate: Biomedicines is a Boston-based therapeutics company at the intersection of machine learning, biological engineering, and medicine. Molly Gibson, cofounder and chief strategy and innovation officer, says the company focuses on designing protein–protein interactions for therapeutic applications.“If you think about biologics, the most important function that a protein takes is creating very specific and potent binding with its target. This could be things like an antibody where we know exactly where we want to neutralize a target, or where we want to agonize and potentiate function,” said Gibson.One project at Generate: Biomedicines has worked to create a broadly neutralizing antibody for coronavirus. Gibson notes that the virus actively mutates on the epitope targeted by biologics, leading to many COVID therapeutics losing emergency use authorization (EUA).“We know that there are some parts of the virus that just don't mutate, but interestingly, our immune systems and the immune system of animals where we traditionally get antibodies don't create antibodies commonly against the non-mutating part of the virus,” Gibson continued.She adds that targeting these nonmutating areas makes the therapeutic less likely to be made ineffective by future virus mutations.In September, Generate: Biomedicines announced their first clinical trial for GB-0669, a monoclonal antibody targeting a highly conserved region of the spike protein in SARS-CoV-2. The company also expects to file a Clinical Trial Application by early Q4 2023 for its anti-TSLP monoclonal antibody in asthma, which is expected to enter clinical trials shortly thereafter.Generate: Biomedicines has a multimodality therapeutic focus with projects in infectious disease, oncology, and immunology. “We've really focused on building a diverse set of expertise, not just in protein design, but also in clinical development and manufacturing,” Gibson says. By integrating various areas of expertise, “we're able to use this technology in ways that impact people,” she added.One key tool is a new cryogenic electron microscopy (CryoEM) facility for generating large-scale structural data to complement the company's in-house protein design machine learning tools and facilitate the drug discovery process. Unveiled in June, this 70,000 square-feet site in Andover, Massachusetts, is among the largest privately owned CryoEM laboratories in the United States.The Baker laboratory is not the only group developing so-called diffusion models, a class of models that leverage generative artificial intelligence (AI), for protein design. The laboratory first posted a preprint on RFdiffusion on bioRxiv last December. At the same time, the AI-focused therapeutics company Generate: Biomedicines posted its diffusion model, called Chroma, as a preprint.3 Chroma is one of many platforms offered by the company (Box 1).A month later, the laboratory of Mohammed AlQuraishi, assistant professor of systems biology at Columbia University, posted a preprint on their own diffusion model, Genie.4 “All of these various groups have been thinking about these models at around the same time,” AlQuraishi told GEN Biotechnology. “RFdiffusion works pretty well. Certainly, it's the most well validated among the published methods [at this time].”Although Chroma is not publicly available, the code for Genie is available for public use. AlQuraishi also states that experimental validation for Genie is underway.RFdiffusion is publicly available in a user-friendly online Google Colaboratory notebook.5 Although many experienced scientists are applying RFdiffusion to their protein design efforts and validating their designs in the laboratory, “anyone with a browser” can design a protein that nature has never seen before on the computer…and share it on social media. No coding knowledge is necessary.Mohammed AlQuraishi, Assistant Professor of Systems Biology at Columbia UniversityProteins Beyond NatureBefore the AI revolution, protein design approaches were limited to generating designs based on nature's existing proteins. These standard methods had limitations, as nature has only sampled a small subset of the possible protein landscape, and evolution does not necessarily select for attributes that are desirable from a pharmaceutical or biotechnological standpoint. Solubility, stability, ease of production, and low immunogenicity are some of the many characteristics that are crucial from an application and scalability perspective.In contrast, generative AI approaches emphasize de novo protein design—designing new proteins from scratch—with the goal of expanding the repertoire of functions and desirable attributes beyond what nature has achieved. Since the landmark release of AlphaFold6—the acclaimed AI program from Google's DeepMind that made a grand leap in solving one of biology's biggest problems, determining a protein's 3D structure from its sequence—AI-powered protein design has been a rising force promising new possibilities for biotechnological applications.Historically, protein structure prediction and design were time-consuming processes due to low experimental validation rates for computationally derived structures. AI tools such as AlphaFold have allowed prediction of protein structures with unprecedented speed and accuracy, streamlining the research process for drug discovery, industrial applications, and more. In September, the developers of AlphaFold, Demis Hassabis and John Jumper, were among the winners of the 2023 Lasker Awards. This prestigious prize recognizes individuals who have made major contributions to medical science.During last month's The State of Biotech—GEN's annual flagship virtual event—renowned UW structural biologist David Baker discussed that the first approved de novo designed medicine, SKYCovione, a COVID vaccine developed by SK Bioscience and the UW Institute for Protein Design (IPD), was approved in June in South Korea for use in adults.“It's an exciting time for protein design!” emphasized Baker, who said the protein design field has seen a major shift from a predominantly biophysical approach, based on the idea that proteins fold to their lowest energy structures, to applying deep learning.Baker, who is a professor in biochemistry, the director of the UW IPD, and a Howard Hughes Medical Institute Investigator, is also a prolific biotech entrepreneur, having cofounded nine companies and serving as a scientific advisor to 18 others, according to the IPD's website.7In Need of a BackboneAs the name suggests, RFdiffusion leverages diffusion models, a generative AI approach that has seen considerable success in image generation tools, such as Midjourney, DALL-E 2 from OpenAI, and Stable Diffusion from Stability AI. These generative models learn the patterns of their training data and generate new outputs with similar characteristics.“We had these neural network architectures that had been useful for protein structure prediction, in particular RoseTTAFold and AlphaFold.6,8 The idea was that maybe these [tools] could be useful for design as well, but it wasn't immediately obvious how to do that in a way that could generate large diversity,” said Brian Trippe, a postdoctoral fellow based in Columbia University and co-advised by Baker. Trippe is another lead author of RFdiffusion.“In a design campaign, you need a bunch of different good options to order and test in the laboratory. The community was generally thinking, ‘how do we get real generative AI involved in protein design?’ People were moving toward diffusion models because of how strong they looked in image generation literature,” continued Juergens.The Baker laboratory's structure-forward approach for de novo protein design follows four steps (Fig. 1). First, a backbone, or a protein structure suggested to produce a specific function, is generated. Second, sequences are designed that are predicted to fold into the desired backbone. Third, the sequences from Step 2 are computational filtered for the top candidates that are most likely to succeed in Step 4, experimental validation.FIG. 1. A structure-forward approach for de novo protein design (Credit: Joseph Watson).(1) Generate a structural backbone with a suggested function. (2) Design sequences that are predicted to fold into the backbone from (1). (3) Computationally filter the sequences from (2) for the top candidates predicted to succeed in experimental validation. (4) Experimentally characterize the predicted design.With the rise of machine learning in protein design, many tools have been developed to facilitate Steps 2 and 3 of this workflow, such as ProteinMPNN9 for fixed backbone sequence design and AlphaFold and RoseTTAFold for computationally validating sequence folds, a method known as a self-consistency measure.Methods for diverse and high-throughput backbone generation (Step 1) have remained a bottleneck. Diffusion models are a natural fit to address this problem given that they can generate large numbers of diverse outputs, operate directly on amino acid coordinates, and condition on a wide range of inputs with the goal of specifying function.Experimentally Validated“We put an enormous amount of effort into working out the mathematical and statistical programming implementation details of this idea and our simulation results were looking just exciting enough to put out a preprint,” said Trippe.It was one week in December 2022, just before Baker posted the RFdiffusion preprint, when the experimental validation started coming in. “We were hearing [remarks from colleagues] that the binders we wanted to try were actually sticking. It seemed that everything that we had tried was working!” said Trippe.Helen Eisenach, a graduate student in the Baker laboratory and another co-lead author of RFdiffusion, emphasizes that one major advantage of RFdiffusion is the improved rate in which hypotheses are generated and tested in the laboratory.“Not only are you generating really good hypotheses in the computer, but then we're seeing that a lot of them are passing. Your rate of finding a successful design is just astronomically higher than a lot of other previous methods,” said Eisenach. “You have both speed on the compute side but also a lot higher accuracy on the experimental side.”She also notes that RFdiffusion is a step forward from previous backbone generation approaches from the Baker laboratory, such as “hallucination,” which performs well to produce designs but has a slower design process, and “inpainting,” which provides quick generation but lacks diversity in output.10Columbia's AlQuraishi concurs that one of the key advantages of diffusion models for protein design lies in the improved experimental success rates.“Prior to the ‘diffusion evolution’, the success rates were probably on the order of maybe 1 to 10,000, if you're lucky,” said AlQuraishi. “With diffusion models, the success rates are closer to the single percentages when you get into the laboratory. They're still not great but it's a huge magnitude improvement of what it used to be and it's been a really big deal.”AlQuraishi also indicates that these improved success rates are related to key conceptual differences that diffusion-based design brings to the table. Before diffusion, the lowest energy search approach to protein design, he says, was “an uninformative way to propose sequences. The vast majority of proposals are not correct and you're hoping that if you sample enough times, you'll happen upon something that works by chance.”“With diffusion models, it's more of a direct thing. You're not generating many hypotheses and then evaluating them in an uninformative way until, by luck, you happen onto something that works. The diffusion models take you to something that already works,” AlQuraishi continued.Learning a New LanguageStructure-forward approaches, such as diffusion models, are not the only machine learning tools infiltrating the protein design field. Large language models that take advantage of protein sequencing data are another rising tool. Earlier this year, researchers from Salesforce Research published ProGen, a protein language model trained on millions of raw protein sequences that generate artificial proteins across multiple families and functions.11Juergens notes that an advantage of sequence-based design is the ability to model function that cannot be described by a static structure.“At the IPD, we have a very structure-forward thought process because many functions can be described by one or a few well-defined structures. There are some molecular functions that have a flexing of a molecule or a shifting of a domain. Language models that reason in sequence space have the capability of implicitly modeling those functions that are described by more than just a single structure,” said Juergens.“A good analogy is that for large language models, [researchers] used to think that you had to model how reasoning worked. You had to do neuro symbolic reasoning, which is this emerging behavior of how intelligence is formed. [Researchers later] found out that if you just do a very simple task, you can actually model very complex structures in human reasoning,” said Jason Yim, a graduate student at MIT and another co-lead author of RFdiffusion.“The idea here is that if you train on all the sequence data for proteins, you can learn something about the higher order of how the proteins operate,” Yim continued. Another factor he says driving the advance of language models in protein design is the wealth of sequence data in comparison to structural data. “In structure, it can take a graduate student an entire PhD to solve a single structure, whereas we can generate sequences at a much faster rate with the current sequencing technology. There's so much data that's underutilized for design.”Helen Eisenach, David Juergens, Brian Trippe and Jason Yim (left to right) are co-lead authors of RFdiffusion who joined GEN Biotechnology for an interview to discuss the newest advances in protein design.The Ultimate ConvergenceAlthough the combination of sequence-based and structure-based machine learning approaches have opened new avenues for protein design, additional growth is on the horizon. AlQuraishi indicates that more work is needed for these models to accommodate complex instructions to generate structures for biotechnological applications.“An exciting frontier is conditional generation, where you don't just care about an end generation [that can be experimentally validated]. You care about specifying certain properties that are desirable for your protein,” said AlQuraishi. “Designing structures with fairly simple constraints [is currently achievable], but something more sophisticated, such as drug discovery, requires more complex constraints.”Building on this complexity, Juergens highlights that the field is also excited about higher order molecular functions that are not limited to protein atoms. “Trying to simulate molecular properties and functions that require, say a protein interacting with a metal, small molecule, or RNA or DNA is a growing area of focus for both structure prediction and design,” he said.Overall, the potential for AI tools to infiltrate all aspects of biotechnology continues to be boundless.“I could certainly see a point in time where [natural language models, such as ChatGPT,] are able to summarize papers and draw insights from the literature,” AlQuraishi remarks. “At some point, we may be able to interface molecular language models with natural language models, such that we can tell ChatGPT, ‘design your protein with those properties’, and then it goes and gives you a sequence.” That's some way off, “but that would be the ultimate convergence.”What new protein design barriers will AI break next? Only time will tell.References1. Watson JL, Juergens D, Bennett NR, et al. De novo design of protein structure and function with RFdiffusion. Nature 2023;620(7976):1089–1100; doi: 10.1038/s41586-023-06415-8 Crossref, Medline, Google Scholar2. Torres SV, Leung PJY, Lutz ID, et al. De novo design of high-affinity protein binders to bioactive helical peptides. bioRxiv 2022;2022.12.10.519862; doi: 10.1101/2022.12.10.519862 Crossref, Google Scholar3. Ingraham J, Baranov M, Costello Z, et al. Illuminating protein space with a programmable generative model. bioRxiv 2022;2022.12.01.518682; doi: 10.1101/2022.12.01.518682 Crossref, Google Scholar4. Lin Y, AlQuraishi M. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. arXiv 2023; doi: 10.48550/arXiv.2301.12485 Crossref, Google Scholar5. RFdiffusion. Available from: https://colab.research.google.com/github/sokrypton/ColabDesign/blob/v1.1.1/rf/examples/diffusion.ipynb [Last accessed: September 18, 2023]. Google Scholar6. Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021;596(7873):583–589; doi: 10.1038/s41586-021-03819-2 Crossref, Medline, Google Scholar7. Design UI for P. David Baker's Technology Transfer and Advisory Roles. 2023. Available from: https://www.ipd.uw.edu/baker-technology-transfer-roles/ [Last accessed: September 18, 2023]. Google Scholar8. Baek M, DiMaio F, Anishchenko I, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science (80-) 2021;373(6557):871–876; doi: 10.1126/science.abj8754 Crossref, Medline, Google Scholar9. Dauparas J, Anishchenko I, Bennett N, et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science (80-) 2022;378(6615):49–56; doi: 10.1126/science.add2187 Crossref, Medline, Google Scholar10. Wang J, Lisanza S, Juergens D, et al. Scaffolding protein functional sites using deep learning. Science (80-) 2022;377(6604):387–394; doi: 10.1126/science.abn2100 Crossref, Medline, Google Scholar11. Madani A, Krause B, Greene ER, et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol 2023;41(8):1099–1106; doi: 10.1038/s41587-022-01618-2 Crossref, Medline, Google ScholarFiguresReferencesRelatedDetails Volume 2Issue 5Oct 2023 InformationCopyright 2023, Mary Ann Liebert, Inc., publishersTo cite this article:Fay Lin.Diffusion Evolution: New Artificial Intelligence Models Break Barriers in Protein Design.GEN Biotechnology.Oct 2023.333-337.http://doi.org/10.1089/genbio.2023.29114.fliPublished in Volume: 2 Issue 5: October 16, 2023PDF download","PeriodicalId":73134,"journal":{"name":"GEN biotechnology","volume":"37 1","pages":"0"},"PeriodicalIF":2.0000,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GEN biotechnology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1089/genbio.2023.29114.fli","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
GEN BiotechnologyVol. 2, No. 5 News FeaturesFree AccessDiffusion Evolution: New Artificial Intelligence Models Break Barriers in Protein DesignFay LinFay LinE-mail Address: [email protected]Senior Editor, GEN BiotechnologySearch for more papers by this authorPublished Online:16 Oct 2023https://doi.org/10.1089/genbio.2023.29114.fliAboutSectionsPDF/EPUB Permissions & CitationsPermissionsDownload CitationsTrack CitationsAdd to favorites Back To Publication ShareShare onFacebookTwitterLinked InRedditEmail Diffusion models, a form of generative artificial intelligence, are a rising tool for protein design, showing improved experimental success and new potential for biotechnological applications.This protein fold is one of thousands designed from scratch using new machine learning methods. (Credit: Ian C. Haydon/UW Institute for Protein Design)In July 2023, scientists in David Baker's laboratory at the University of Washington (UW) published a report in Nature detailing a new deep-learning framework for de novo protein design called RoseTTAFold diffusion (RFdiffusion), in Nature.1 Since then, the scientific community has been buzzing about RFdiffusion's unprecedented experimental success rate and ease of use.David Juergens, a graduate student in Baker's laboratory and one of seven co-lead authors of the Nature article, shared an anecdote about a scientist working in a lab in China, who posted on social media how “they designed a protein in a browser, ordered the sequence, purified the protein, crystallized it, and then got a crystal structure that was half an angstrom away from the design that was on the computer. It was amazing!” Juergens told me.David Baker, Professor in Biochemistry and Director of the Institute for Protein Design at UWSome of the applications of RFdiffusion, documented with experimental validation in the Nature article, include design of symmetric oligomers for vaccine platforms and delivery vehicles and generation of high-affinity binders for therapeutics.1 In another project, the Baker laboratory has applied RFdiffusion to design proteins that bind peptide hormones—established biomarkers for clinical care and biomedical research—for diagnostic applications.2Box 1. Let's Generate interactionsGenerate: Biomedicines is a Boston-based therapeutics company at the intersection of machine learning, biological engineering, and medicine. Molly Gibson, cofounder and chief strategy and innovation officer, says the company focuses on designing protein–protein interactions for therapeutic applications.“If you think about biologics, the most important function that a protein takes is creating very specific and potent binding with its target. This could be things like an antibody where we know exactly where we want to neutralize a target, or where we want to agonize and potentiate function,” said Gibson.One project at Generate: Biomedicines has worked to create a broadly neutralizing antibody for coronavirus. Gibson notes that the virus actively mutates on the epitope targeted by biologics, leading to many COVID therapeutics losing emergency use authorization (EUA).“We know that there are some parts of the virus that just don't mutate, but interestingly, our immune systems and the immune system of animals where we traditionally get antibodies don't create antibodies commonly against the non-mutating part of the virus,” Gibson continued.She adds that targeting these nonmutating areas makes the therapeutic less likely to be made ineffective by future virus mutations.In September, Generate: Biomedicines announced their first clinical trial for GB-0669, a monoclonal antibody targeting a highly conserved region of the spike protein in SARS-CoV-2. The company also expects to file a Clinical Trial Application by early Q4 2023 for its anti-TSLP monoclonal antibody in asthma, which is expected to enter clinical trials shortly thereafter.Generate: Biomedicines has a multimodality therapeutic focus with projects in infectious disease, oncology, and immunology. “We've really focused on building a diverse set of expertise, not just in protein design, but also in clinical development and manufacturing,” Gibson says. By integrating various areas of expertise, “we're able to use this technology in ways that impact people,” she added.One key tool is a new cryogenic electron microscopy (CryoEM) facility for generating large-scale structural data to complement the company's in-house protein design machine learning tools and facilitate the drug discovery process. Unveiled in June, this 70,000 square-feet site in Andover, Massachusetts, is among the largest privately owned CryoEM laboratories in the United States.The Baker laboratory is not the only group developing so-called diffusion models, a class of models that leverage generative artificial intelligence (AI), for protein design. The laboratory first posted a preprint on RFdiffusion on bioRxiv last December. At the same time, the AI-focused therapeutics company Generate: Biomedicines posted its diffusion model, called Chroma, as a preprint.3 Chroma is one of many platforms offered by the company (Box 1).A month later, the laboratory of Mohammed AlQuraishi, assistant professor of systems biology at Columbia University, posted a preprint on their own diffusion model, Genie.4 “All of these various groups have been thinking about these models at around the same time,” AlQuraishi told GEN Biotechnology. “RFdiffusion works pretty well. Certainly, it's the most well validated among the published methods [at this time].”Although Chroma is not publicly available, the code for Genie is available for public use. AlQuraishi also states that experimental validation for Genie is underway.RFdiffusion is publicly available in a user-friendly online Google Colaboratory notebook.5 Although many experienced scientists are applying RFdiffusion to their protein design efforts and validating their designs in the laboratory, “anyone with a browser” can design a protein that nature has never seen before on the computer…and share it on social media. No coding knowledge is necessary.Mohammed AlQuraishi, Assistant Professor of Systems Biology at Columbia UniversityProteins Beyond NatureBefore the AI revolution, protein design approaches were limited to generating designs based on nature's existing proteins. These standard methods had limitations, as nature has only sampled a small subset of the possible protein landscape, and evolution does not necessarily select for attributes that are desirable from a pharmaceutical or biotechnological standpoint. Solubility, stability, ease of production, and low immunogenicity are some of the many characteristics that are crucial from an application and scalability perspective.In contrast, generative AI approaches emphasize de novo protein design—designing new proteins from scratch—with the goal of expanding the repertoire of functions and desirable attributes beyond what nature has achieved. Since the landmark release of AlphaFold6—the acclaimed AI program from Google's DeepMind that made a grand leap in solving one of biology's biggest problems, determining a protein's 3D structure from its sequence—AI-powered protein design has been a rising force promising new possibilities for biotechnological applications.Historically, protein structure prediction and design were time-consuming processes due to low experimental validation rates for computationally derived structures. AI tools such as AlphaFold have allowed prediction of protein structures with unprecedented speed and accuracy, streamlining the research process for drug discovery, industrial applications, and more. In September, the developers of AlphaFold, Demis Hassabis and John Jumper, were among the winners of the 2023 Lasker Awards. This prestigious prize recognizes individuals who have made major contributions to medical science.During last month's The State of Biotech—GEN's annual flagship virtual event—renowned UW structural biologist David Baker discussed that the first approved de novo designed medicine, SKYCovione, a COVID vaccine developed by SK Bioscience and the UW Institute for Protein Design (IPD), was approved in June in South Korea for use in adults.“It's an exciting time for protein design!” emphasized Baker, who said the protein design field has seen a major shift from a predominantly biophysical approach, based on the idea that proteins fold to their lowest energy structures, to applying deep learning.Baker, who is a professor in biochemistry, the director of the UW IPD, and a Howard Hughes Medical Institute Investigator, is also a prolific biotech entrepreneur, having cofounded nine companies and serving as a scientific advisor to 18 others, according to the IPD's website.7In Need of a BackboneAs the name suggests, RFdiffusion leverages diffusion models, a generative AI approach that has seen considerable success in image generation tools, such as Midjourney, DALL-E 2 from OpenAI, and Stable Diffusion from Stability AI. These generative models learn the patterns of their training data and generate new outputs with similar characteristics.“We had these neural network architectures that had been useful for protein structure prediction, in particular RoseTTAFold and AlphaFold.6,8 The idea was that maybe these [tools] could be useful for design as well, but it wasn't immediately obvious how to do that in a way that could generate large diversity,” said Brian Trippe, a postdoctoral fellow based in Columbia University and co-advised by Baker. Trippe is another lead author of RFdiffusion.“In a design campaign, you need a bunch of different good options to order and test in the laboratory. The community was generally thinking, ‘how do we get real generative AI involved in protein design?’ People were moving toward diffusion models because of how strong they looked in image generation literature,” continued Juergens.The Baker laboratory's structure-forward approach for de novo protein design follows four steps (Fig. 1). First, a backbone, or a protein structure suggested to produce a specific function, is generated. Second, sequences are designed that are predicted to fold into the desired backbone. Third, the sequences from Step 2 are computational filtered for the top candidates that are most likely to succeed in Step 4, experimental validation.FIG. 1. A structure-forward approach for de novo protein design (Credit: Joseph Watson).(1) Generate a structural backbone with a suggested function. (2) Design sequences that are predicted to fold into the backbone from (1). (3) Computationally filter the sequences from (2) for the top candidates predicted to succeed in experimental validation. (4) Experimentally characterize the predicted design.With the rise of machine learning in protein design, many tools have been developed to facilitate Steps 2 and 3 of this workflow, such as ProteinMPNN9 for fixed backbone sequence design and AlphaFold and RoseTTAFold for computationally validating sequence folds, a method known as a self-consistency measure.Methods for diverse and high-throughput backbone generation (Step 1) have remained a bottleneck. Diffusion models are a natural fit to address this problem given that they can generate large numbers of diverse outputs, operate directly on amino acid coordinates, and condition on a wide range of inputs with the goal of specifying function.Experimentally Validated“We put an enormous amount of effort into working out the mathematical and statistical programming implementation details of this idea and our simulation results were looking just exciting enough to put out a preprint,” said Trippe.It was one week in December 2022, just before Baker posted the RFdiffusion preprint, when the experimental validation started coming in. “We were hearing [remarks from colleagues] that the binders we wanted to try were actually sticking. It seemed that everything that we had tried was working!” said Trippe.Helen Eisenach, a graduate student in the Baker laboratory and another co-lead author of RFdiffusion, emphasizes that one major advantage of RFdiffusion is the improved rate in which hypotheses are generated and tested in the laboratory.“Not only are you generating really good hypotheses in the computer, but then we're seeing that a lot of them are passing. Your rate of finding a successful design is just astronomically higher than a lot of other previous methods,” said Eisenach. “You have both speed on the compute side but also a lot higher accuracy on the experimental side.”She also notes that RFdiffusion is a step forward from previous backbone generation approaches from the Baker laboratory, such as “hallucination,” which performs well to produce designs but has a slower design process, and “inpainting,” which provides quick generation but lacks diversity in output.10Columbia's AlQuraishi concurs that one of the key advantages of diffusion models for protein design lies in the improved experimental success rates.“Prior to the ‘diffusion evolution’, the success rates were probably on the order of maybe 1 to 10,000, if you're lucky,” said AlQuraishi. “With diffusion models, the success rates are closer to the single percentages when you get into the laboratory. They're still not great but it's a huge magnitude improvement of what it used to be and it's been a really big deal.”AlQuraishi also indicates that these improved success rates are related to key conceptual differences that diffusion-based design brings to the table. Before diffusion, the lowest energy search approach to protein design, he says, was “an uninformative way to propose sequences. The vast majority of proposals are not correct and you're hoping that if you sample enough times, you'll happen upon something that works by chance.”“With diffusion models, it's more of a direct thing. You're not generating many hypotheses and then evaluating them in an uninformative way until, by luck, you happen onto something that works. The diffusion models take you to something that already works,” AlQuraishi continued.Learning a New LanguageStructure-forward approaches, such as diffusion models, are not the only machine learning tools infiltrating the protein design field. Large language models that take advantage of protein sequencing data are another rising tool. Earlier this year, researchers from Salesforce Research published ProGen, a protein language model trained on millions of raw protein sequences that generate artificial proteins across multiple families and functions.11Juergens notes that an advantage of sequence-based design is the ability to model function that cannot be described by a static structure.“At the IPD, we have a very structure-forward thought process because many functions can be described by one or a few well-defined structures. There are some molecular functions that have a flexing of a molecule or a shifting of a domain. Language models that reason in sequence space have the capability of implicitly modeling those functions that are described by more than just a single structure,” said Juergens.“A good analogy is that for large language models, [researchers] used to think that you had to model how reasoning worked. You had to do neuro symbolic reasoning, which is this emerging behavior of how intelligence is formed. [Researchers later] found out that if you just do a very simple task, you can actually model very complex structures in human reasoning,” said Jason Yim, a graduate student at MIT and another co-lead author of RFdiffusion.“The idea here is that if you train on all the sequence data for proteins, you can learn something about the higher order of how the proteins operate,” Yim continued. Another factor he says driving the advance of language models in protein design is the wealth of sequence data in comparison to structural data. “In structure, it can take a graduate student an entire PhD to solve a single structure, whereas we can generate sequences at a much faster rate with the current sequencing technology. There's so much data that's underutilized for design.”Helen Eisenach, David Juergens, Brian Trippe and Jason Yim (left to right) are co-lead authors of RFdiffusion who joined GEN Biotechnology for an interview to discuss the newest advances in protein design.The Ultimate ConvergenceAlthough the combination of sequence-based and structure-based machine learning approaches have opened new avenues for protein design, additional growth is on the horizon. AlQuraishi indicates that more work is needed for these models to accommodate complex instructions to generate structures for biotechnological applications.“An exciting frontier is conditional generation, where you don't just care about an end generation [that can be experimentally validated]. You care about specifying certain properties that are desirable for your protein,” said AlQuraishi. “Designing structures with fairly simple constraints [is currently achievable], but something more sophisticated, such as drug discovery, requires more complex constraints.”Building on this complexity, Juergens highlights that the field is also excited about higher order molecular functions that are not limited to protein atoms. “Trying to simulate molecular properties and functions that require, say a protein interacting with a metal, small molecule, or RNA or DNA is a growing area of focus for both structure prediction and design,” he said.Overall, the potential for AI tools to infiltrate all aspects of biotechnology continues to be boundless.“I could certainly see a point in time where [natural language models, such as ChatGPT,] are able to summarize papers and draw insights from the literature,” AlQuraishi remarks. “At some point, we may be able to interface molecular language models with natural language models, such that we can tell ChatGPT, ‘design your protein with those properties’, and then it goes and gives you a sequence.” That's some way off, “but that would be the ultimate convergence.”What new protein design barriers will AI break next? Only time will tell.References1. Watson JL, Juergens D, Bennett NR, et al. De novo design of protein structure and function with RFdiffusion. Nature 2023;620(7976):1089–1100; doi: 10.1038/s41586-023-06415-8 Crossref, Medline, Google Scholar2. Torres SV, Leung PJY, Lutz ID, et al. De novo design of high-affinity protein binders to bioactive helical peptides. bioRxiv 2022;2022.12.10.519862; doi: 10.1101/2022.12.10.519862 Crossref, Google Scholar3. Ingraham J, Baranov M, Costello Z, et al. Illuminating protein space with a programmable generative model. bioRxiv 2022;2022.12.01.518682; doi: 10.1101/2022.12.01.518682 Crossref, Google Scholar4. Lin Y, AlQuraishi M. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. arXiv 2023; doi: 10.48550/arXiv.2301.12485 Crossref, Google Scholar5. RFdiffusion. Available from: https://colab.research.google.com/github/sokrypton/ColabDesign/blob/v1.1.1/rf/examples/diffusion.ipynb [Last accessed: September 18, 2023]. Google Scholar6. Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021;596(7873):583–589; doi: 10.1038/s41586-021-03819-2 Crossref, Medline, Google Scholar7. Design UI for P. David Baker's Technology Transfer and Advisory Roles. 2023. Available from: https://www.ipd.uw.edu/baker-technology-transfer-roles/ [Last accessed: September 18, 2023]. Google Scholar8. Baek M, DiMaio F, Anishchenko I, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science (80-) 2021;373(6557):871–876; doi: 10.1126/science.abj8754 Crossref, Medline, Google Scholar9. Dauparas J, Anishchenko I, Bennett N, et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science (80-) 2022;378(6615):49–56; doi: 10.1126/science.add2187 Crossref, Medline, Google Scholar10. Wang J, Lisanza S, Juergens D, et al. Scaffolding protein functional sites using deep learning. Science (80-) 2022;377(6604):387–394; doi: 10.1126/science.abn2100 Crossref, Medline, Google Scholar11. Madani A, Krause B, Greene ER, et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol 2023;41(8):1099–1106; doi: 10.1038/s41587-022-01618-2 Crossref, Medline, Google ScholarFiguresReferencesRelatedDetails Volume 2Issue 5Oct 2023 InformationCopyright 2023, Mary Ann Liebert, Inc., publishersTo cite this article:Fay Lin.Diffusion Evolution: New Artificial Intelligence Models Break Barriers in Protein Design.GEN Biotechnology.Oct 2023.333-337.http://doi.org/10.1089/genbio.2023.29114.fliPublished in Volume: 2 Issue 5: October 16, 2023PDF download