{"title":"扩散进化:新的人工智能模型打破了蛋白质设计的障碍","authors":"Fay Lin","doi":"10.1089/genbio.2023.29114.fli","DOIUrl":null,"url":null,"abstract":"GEN BiotechnologyVol. 2, No. 5 News FeaturesFree AccessDiffusion Evolution: New Artificial Intelligence Models Break Barriers in Protein DesignFay LinFay LinE-mail Address: [email protected]Senior Editor, GEN BiotechnologySearch for more papers by this authorPublished Online:16 Oct 2023https://doi.org/10.1089/genbio.2023.29114.fliAboutSectionsPDF/EPUB Permissions & CitationsPermissionsDownload CitationsTrack CitationsAdd to favorites Back To Publication ShareShare onFacebookTwitterLinked InRedditEmail Diffusion models, a form of generative artificial intelligence, are a rising tool for protein design, showing improved experimental success and new potential for biotechnological applications.This protein fold is one of thousands designed from scratch using new machine learning methods. (Credit: Ian C. Haydon/UW Institute for Protein Design)In July 2023, scientists in David Baker's laboratory at the University of Washington (UW) published a report in Nature detailing a new deep-learning framework for de novo protein design called RoseTTAFold diffusion (RFdiffusion), in Nature.1 Since then, the scientific community has been buzzing about RFdiffusion's unprecedented experimental success rate and ease of use.David Juergens, a graduate student in Baker's laboratory and one of seven co-lead authors of the Nature article, shared an anecdote about a scientist working in a lab in China, who posted on social media how “they designed a protein in a browser, ordered the sequence, purified the protein, crystallized it, and then got a crystal structure that was half an angstrom away from the design that was on the computer. It was amazing!” Juergens told me.David Baker, Professor in Biochemistry and Director of the Institute for Protein Design at UWSome of the applications of RFdiffusion, documented with experimental validation in the Nature article, include design of symmetric oligomers for vaccine platforms and delivery vehicles and generation of high-affinity binders for therapeutics.1 In another project, the Baker laboratory has applied RFdiffusion to design proteins that bind peptide hormones—established biomarkers for clinical care and biomedical research—for diagnostic applications.2Box 1. Let's Generate interactionsGenerate: Biomedicines is a Boston-based therapeutics company at the intersection of machine learning, biological engineering, and medicine. Molly Gibson, cofounder and chief strategy and innovation officer, says the company focuses on designing protein–protein interactions for therapeutic applications.“If you think about biologics, the most important function that a protein takes is creating very specific and potent binding with its target. This could be things like an antibody where we know exactly where we want to neutralize a target, or where we want to agonize and potentiate function,” said Gibson.One project at Generate: Biomedicines has worked to create a broadly neutralizing antibody for coronavirus. Gibson notes that the virus actively mutates on the epitope targeted by biologics, leading to many COVID therapeutics losing emergency use authorization (EUA).“We know that there are some parts of the virus that just don't mutate, but interestingly, our immune systems and the immune system of animals where we traditionally get antibodies don't create antibodies commonly against the non-mutating part of the virus,” Gibson continued.She adds that targeting these nonmutating areas makes the therapeutic less likely to be made ineffective by future virus mutations.In September, Generate: Biomedicines announced their first clinical trial for GB-0669, a monoclonal antibody targeting a highly conserved region of the spike protein in SARS-CoV-2. The company also expects to file a Clinical Trial Application by early Q4 2023 for its anti-TSLP monoclonal antibody in asthma, which is expected to enter clinical trials shortly thereafter.Generate: Biomedicines has a multimodality therapeutic focus with projects in infectious disease, oncology, and immunology. “We've really focused on building a diverse set of expertise, not just in protein design, but also in clinical development and manufacturing,” Gibson says. By integrating various areas of expertise, “we're able to use this technology in ways that impact people,” she added.One key tool is a new cryogenic electron microscopy (CryoEM) facility for generating large-scale structural data to complement the company's in-house protein design machine learning tools and facilitate the drug discovery process. Unveiled in June, this 70,000 square-feet site in Andover, Massachusetts, is among the largest privately owned CryoEM laboratories in the United States.The Baker laboratory is not the only group developing so-called diffusion models, a class of models that leverage generative artificial intelligence (AI), for protein design. The laboratory first posted a preprint on RFdiffusion on bioRxiv last December. At the same time, the AI-focused therapeutics company Generate: Biomedicines posted its diffusion model, called Chroma, as a preprint.3 Chroma is one of many platforms offered by the company (Box 1).A month later, the laboratory of Mohammed AlQuraishi, assistant professor of systems biology at Columbia University, posted a preprint on their own diffusion model, Genie.4 “All of these various groups have been thinking about these models at around the same time,” AlQuraishi told GEN Biotechnology. “RFdiffusion works pretty well. Certainly, it's the most well validated among the published methods [at this time].”Although Chroma is not publicly available, the code for Genie is available for public use. AlQuraishi also states that experimental validation for Genie is underway.RFdiffusion is publicly available in a user-friendly online Google Colaboratory notebook.5 Although many experienced scientists are applying RFdiffusion to their protein design efforts and validating their designs in the laboratory, “anyone with a browser” can design a protein that nature has never seen before on the computer…and share it on social media. No coding knowledge is necessary.Mohammed AlQuraishi, Assistant Professor of Systems Biology at Columbia UniversityProteins Beyond NatureBefore the AI revolution, protein design approaches were limited to generating designs based on nature's existing proteins. These standard methods had limitations, as nature has only sampled a small subset of the possible protein landscape, and evolution does not necessarily select for attributes that are desirable from a pharmaceutical or biotechnological standpoint. Solubility, stability, ease of production, and low immunogenicity are some of the many characteristics that are crucial from an application and scalability perspective.In contrast, generative AI approaches emphasize de novo protein design—designing new proteins from scratch—with the goal of expanding the repertoire of functions and desirable attributes beyond what nature has achieved. Since the landmark release of AlphaFold6—the acclaimed AI program from Google's DeepMind that made a grand leap in solving one of biology's biggest problems, determining a protein's 3D structure from its sequence—AI-powered protein design has been a rising force promising new possibilities for biotechnological applications.Historically, protein structure prediction and design were time-consuming processes due to low experimental validation rates for computationally derived structures. AI tools such as AlphaFold have allowed prediction of protein structures with unprecedented speed and accuracy, streamlining the research process for drug discovery, industrial applications, and more. In September, the developers of AlphaFold, Demis Hassabis and John Jumper, were among the winners of the 2023 Lasker Awards. This prestigious prize recognizes individuals who have made major contributions to medical science.During last month's The State of Biotech—GEN's annual flagship virtual event—renowned UW structural biologist David Baker discussed that the first approved de novo designed medicine, SKYCovione, a COVID vaccine developed by SK Bioscience and the UW Institute for Protein Design (IPD), was approved in June in South Korea for use in adults.“It's an exciting time for protein design!” emphasized Baker, who said the protein design field has seen a major shift from a predominantly biophysical approach, based on the idea that proteins fold to their lowest energy structures, to applying deep learning.Baker, who is a professor in biochemistry, the director of the UW IPD, and a Howard Hughes Medical Institute Investigator, is also a prolific biotech entrepreneur, having cofounded nine companies and serving as a scientific advisor to 18 others, according to the IPD's website.7In Need of a BackboneAs the name suggests, RFdiffusion leverages diffusion models, a generative AI approach that has seen considerable success in image generation tools, such as Midjourney, DALL-E 2 from OpenAI, and Stable Diffusion from Stability AI. These generative models learn the patterns of their training data and generate new outputs with similar characteristics.“We had these neural network architectures that had been useful for protein structure prediction, in particular RoseTTAFold and AlphaFold.6,8 The idea was that maybe these [tools] could be useful for design as well, but it wasn't immediately obvious how to do that in a way that could generate large diversity,” said Brian Trippe, a postdoctoral fellow based in Columbia University and co-advised by Baker. Trippe is another lead author of RFdiffusion.“In a design campaign, you need a bunch of different good options to order and test in the laboratory. The community was generally thinking, ‘how do we get real generative AI involved in protein design?’ People were moving toward diffusion models because of how strong they looked in image generation literature,” continued Juergens.The Baker laboratory's structure-forward approach for de novo protein design follows four steps (Fig. 1). First, a backbone, or a protein structure suggested to produce a specific function, is generated. Second, sequences are designed that are predicted to fold into the desired backbone. Third, the sequences from Step 2 are computational filtered for the top candidates that are most likely to succeed in Step 4, experimental validation.FIG. 1. A structure-forward approach for de novo protein design (Credit: Joseph Watson).(1) Generate a structural backbone with a suggested function. (2) Design sequences that are predicted to fold into the backbone from (1). (3) Computationally filter the sequences from (2) for the top candidates predicted to succeed in experimental validation. (4) Experimentally characterize the predicted design.With the rise of machine learning in protein design, many tools have been developed to facilitate Steps 2 and 3 of this workflow, such as ProteinMPNN9 for fixed backbone sequence design and AlphaFold and RoseTTAFold for computationally validating sequence folds, a method known as a self-consistency measure.Methods for diverse and high-throughput backbone generation (Step 1) have remained a bottleneck. Diffusion models are a natural fit to address this problem given that they can generate large numbers of diverse outputs, operate directly on amino acid coordinates, and condition on a wide range of inputs with the goal of specifying function.Experimentally Validated“We put an enormous amount of effort into working out the mathematical and statistical programming implementation details of this idea and our simulation results were looking just exciting enough to put out a preprint,” said Trippe.It was one week in December 2022, just before Baker posted the RFdiffusion preprint, when the experimental validation started coming in. “We were hearing [remarks from colleagues] that the binders we wanted to try were actually sticking. It seemed that everything that we had tried was working!” said Trippe.Helen Eisenach, a graduate student in the Baker laboratory and another co-lead author of RFdiffusion, emphasizes that one major advantage of RFdiffusion is the improved rate in which hypotheses are generated and tested in the laboratory.“Not only are you generating really good hypotheses in the computer, but then we're seeing that a lot of them are passing. Your rate of finding a successful design is just astronomically higher than a lot of other previous methods,” said Eisenach. “You have both speed on the compute side but also a lot higher accuracy on the experimental side.”She also notes that RFdiffusion is a step forward from previous backbone generation approaches from the Baker laboratory, such as “hallucination,” which performs well to produce designs but has a slower design process, and “inpainting,” which provides quick generation but lacks diversity in output.10Columbia's AlQuraishi concurs that one of the key advantages of diffusion models for protein design lies in the improved experimental success rates.“Prior to the ‘diffusion evolution’, the success rates were probably on the order of maybe 1 to 10,000, if you're lucky,” said AlQuraishi. “With diffusion models, the success rates are closer to the single percentages when you get into the laboratory. They're still not great but it's a huge magnitude improvement of what it used to be and it's been a really big deal.”AlQuraishi also indicates that these improved success rates are related to key conceptual differences that diffusion-based design brings to the table. Before diffusion, the lowest energy search approach to protein design, he says, was “an uninformative way to propose sequences. The vast majority of proposals are not correct and you're hoping that if you sample enough times, you'll happen upon something that works by chance.”“With diffusion models, it's more of a direct thing. You're not generating many hypotheses and then evaluating them in an uninformative way until, by luck, you happen onto something that works. The diffusion models take you to something that already works,” AlQuraishi continued.Learning a New LanguageStructure-forward approaches, such as diffusion models, are not the only machine learning tools infiltrating the protein design field. Large language models that take advantage of protein sequencing data are another rising tool. Earlier this year, researchers from Salesforce Research published ProGen, a protein language model trained on millions of raw protein sequences that generate artificial proteins across multiple families and functions.11Juergens notes that an advantage of sequence-based design is the ability to model function that cannot be described by a static structure.“At the IPD, we have a very structure-forward thought process because many functions can be described by one or a few well-defined structures. There are some molecular functions that have a flexing of a molecule or a shifting of a domain. Language models that reason in sequence space have the capability of implicitly modeling those functions that are described by more than just a single structure,” said Juergens.“A good analogy is that for large language models, [researchers] used to think that you had to model how reasoning worked. You had to do neuro symbolic reasoning, which is this emerging behavior of how intelligence is formed. [Researchers later] found out that if you just do a very simple task, you can actually model very complex structures in human reasoning,” said Jason Yim, a graduate student at MIT and another co-lead author of RFdiffusion.“The idea here is that if you train on all the sequence data for proteins, you can learn something about the higher order of how the proteins operate,” Yim continued. Another factor he says driving the advance of language models in protein design is the wealth of sequence data in comparison to structural data. “In structure, it can take a graduate student an entire PhD to solve a single structure, whereas we can generate sequences at a much faster rate with the current sequencing technology. There's so much data that's underutilized for design.”Helen Eisenach, David Juergens, Brian Trippe and Jason Yim (left to right) are co-lead authors of RFdiffusion who joined GEN Biotechnology for an interview to discuss the newest advances in protein design.The Ultimate ConvergenceAlthough the combination of sequence-based and structure-based machine learning approaches have opened new avenues for protein design, additional growth is on the horizon. AlQuraishi indicates that more work is needed for these models to accommodate complex instructions to generate structures for biotechnological applications.“An exciting frontier is conditional generation, where you don't just care about an end generation [that can be experimentally validated]. You care about specifying certain properties that are desirable for your protein,” said AlQuraishi. “Designing structures with fairly simple constraints [is currently achievable], but something more sophisticated, such as drug discovery, requires more complex constraints.”Building on this complexity, Juergens highlights that the field is also excited about higher order molecular functions that are not limited to protein atoms. “Trying to simulate molecular properties and functions that require, say a protein interacting with a metal, small molecule, or RNA or DNA is a growing area of focus for both structure prediction and design,” he said.Overall, the potential for AI tools to infiltrate all aspects of biotechnology continues to be boundless.“I could certainly see a point in time where [natural language models, such as ChatGPT,] are able to summarize papers and draw insights from the literature,” AlQuraishi remarks. “At some point, we may be able to interface molecular language models with natural language models, such that we can tell ChatGPT, ‘design your protein with those properties’, and then it goes and gives you a sequence.” That's some way off, “but that would be the ultimate convergence.”What new protein design barriers will AI break next? Only time will tell.References1. Watson JL, Juergens D, Bennett NR, et al. De novo design of protein structure and function with RFdiffusion. Nature 2023;620(7976):1089–1100; doi: 10.1038/s41586-023-06415-8 Crossref, Medline, Google Scholar2. Torres SV, Leung PJY, Lutz ID, et al. De novo design of high-affinity protein binders to bioactive helical peptides. bioRxiv 2022;2022.12.10.519862; doi: 10.1101/2022.12.10.519862 Crossref, Google Scholar3. Ingraham J, Baranov M, Costello Z, et al. Illuminating protein space with a programmable generative model. bioRxiv 2022;2022.12.01.518682; doi: 10.1101/2022.12.01.518682 Crossref, Google Scholar4. Lin Y, AlQuraishi M. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. arXiv 2023; doi: 10.48550/arXiv.2301.12485 Crossref, Google Scholar5. RFdiffusion. Available from: https://colab.research.google.com/github/sokrypton/ColabDesign/blob/v1.1.1/rf/examples/diffusion.ipynb [Last accessed: September 18, 2023]. Google Scholar6. Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021;596(7873):583–589; doi: 10.1038/s41586-021-03819-2 Crossref, Medline, Google Scholar7. Design UI for P. David Baker's Technology Transfer and Advisory Roles. 2023. Available from: https://www.ipd.uw.edu/baker-technology-transfer-roles/ [Last accessed: September 18, 2023]. Google Scholar8. Baek M, DiMaio F, Anishchenko I, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science (80-) 2021;373(6557):871–876; doi: 10.1126/science.abj8754 Crossref, Medline, Google Scholar9. Dauparas J, Anishchenko I, Bennett N, et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science (80-) 2022;378(6615):49–56; doi: 10.1126/science.add2187 Crossref, Medline, Google Scholar10. Wang J, Lisanza S, Juergens D, et al. Scaffolding protein functional sites using deep learning. Science (80-) 2022;377(6604):387–394; doi: 10.1126/science.abn2100 Crossref, Medline, Google Scholar11. Madani A, Krause B, Greene ER, et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol 2023;41(8):1099–1106; doi: 10.1038/s41587-022-01618-2 Crossref, Medline, Google ScholarFiguresReferencesRelatedDetails Volume 2Issue 5Oct 2023 InformationCopyright 2023, Mary Ann Liebert, Inc., publishersTo cite this article:Fay Lin.Diffusion Evolution: New Artificial Intelligence Models Break Barriers in Protein Design.GEN Biotechnology.Oct 2023.333-337.http://doi.org/10.1089/genbio.2023.29114.fliPublished in Volume: 2 Issue 5: October 16, 2023PDF download","PeriodicalId":73134,"journal":{"name":"GEN biotechnology","volume":"37 1","pages":"0"},"PeriodicalIF":2.0000,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Diffusion Evolution: New Artificial Intelligence Models Break Barriers in Protein Design\",\"authors\":\"Fay Lin\",\"doi\":\"10.1089/genbio.2023.29114.fli\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"GEN BiotechnologyVol. 2, No. 5 News FeaturesFree AccessDiffusion Evolution: New Artificial Intelligence Models Break Barriers in Protein DesignFay LinFay LinE-mail Address: [email protected]Senior Editor, GEN BiotechnologySearch for more papers by this authorPublished Online:16 Oct 2023https://doi.org/10.1089/genbio.2023.29114.fliAboutSectionsPDF/EPUB Permissions & CitationsPermissionsDownload CitationsTrack CitationsAdd to favorites Back To Publication ShareShare onFacebookTwitterLinked InRedditEmail Diffusion models, a form of generative artificial intelligence, are a rising tool for protein design, showing improved experimental success and new potential for biotechnological applications.This protein fold is one of thousands designed from scratch using new machine learning methods. (Credit: Ian C. Haydon/UW Institute for Protein Design)In July 2023, scientists in David Baker's laboratory at the University of Washington (UW) published a report in Nature detailing a new deep-learning framework for de novo protein design called RoseTTAFold diffusion (RFdiffusion), in Nature.1 Since then, the scientific community has been buzzing about RFdiffusion's unprecedented experimental success rate and ease of use.David Juergens, a graduate student in Baker's laboratory and one of seven co-lead authors of the Nature article, shared an anecdote about a scientist working in a lab in China, who posted on social media how “they designed a protein in a browser, ordered the sequence, purified the protein, crystallized it, and then got a crystal structure that was half an angstrom away from the design that was on the computer. It was amazing!” Juergens told me.David Baker, Professor in Biochemistry and Director of the Institute for Protein Design at UWSome of the applications of RFdiffusion, documented with experimental validation in the Nature article, include design of symmetric oligomers for vaccine platforms and delivery vehicles and generation of high-affinity binders for therapeutics.1 In another project, the Baker laboratory has applied RFdiffusion to design proteins that bind peptide hormones—established biomarkers for clinical care and biomedical research—for diagnostic applications.2Box 1. Let's Generate interactionsGenerate: Biomedicines is a Boston-based therapeutics company at the intersection of machine learning, biological engineering, and medicine. Molly Gibson, cofounder and chief strategy and innovation officer, says the company focuses on designing protein–protein interactions for therapeutic applications.“If you think about biologics, the most important function that a protein takes is creating very specific and potent binding with its target. This could be things like an antibody where we know exactly where we want to neutralize a target, or where we want to agonize and potentiate function,” said Gibson.One project at Generate: Biomedicines has worked to create a broadly neutralizing antibody for coronavirus. Gibson notes that the virus actively mutates on the epitope targeted by biologics, leading to many COVID therapeutics losing emergency use authorization (EUA).“We know that there are some parts of the virus that just don't mutate, but interestingly, our immune systems and the immune system of animals where we traditionally get antibodies don't create antibodies commonly against the non-mutating part of the virus,” Gibson continued.She adds that targeting these nonmutating areas makes the therapeutic less likely to be made ineffective by future virus mutations.In September, Generate: Biomedicines announced their first clinical trial for GB-0669, a monoclonal antibody targeting a highly conserved region of the spike protein in SARS-CoV-2. The company also expects to file a Clinical Trial Application by early Q4 2023 for its anti-TSLP monoclonal antibody in asthma, which is expected to enter clinical trials shortly thereafter.Generate: Biomedicines has a multimodality therapeutic focus with projects in infectious disease, oncology, and immunology. “We've really focused on building a diverse set of expertise, not just in protein design, but also in clinical development and manufacturing,” Gibson says. By integrating various areas of expertise, “we're able to use this technology in ways that impact people,” she added.One key tool is a new cryogenic electron microscopy (CryoEM) facility for generating large-scale structural data to complement the company's in-house protein design machine learning tools and facilitate the drug discovery process. Unveiled in June, this 70,000 square-feet site in Andover, Massachusetts, is among the largest privately owned CryoEM laboratories in the United States.The Baker laboratory is not the only group developing so-called diffusion models, a class of models that leverage generative artificial intelligence (AI), for protein design. The laboratory first posted a preprint on RFdiffusion on bioRxiv last December. At the same time, the AI-focused therapeutics company Generate: Biomedicines posted its diffusion model, called Chroma, as a preprint.3 Chroma is one of many platforms offered by the company (Box 1).A month later, the laboratory of Mohammed AlQuraishi, assistant professor of systems biology at Columbia University, posted a preprint on their own diffusion model, Genie.4 “All of these various groups have been thinking about these models at around the same time,” AlQuraishi told GEN Biotechnology. “RFdiffusion works pretty well. Certainly, it's the most well validated among the published methods [at this time].”Although Chroma is not publicly available, the code for Genie is available for public use. AlQuraishi also states that experimental validation for Genie is underway.RFdiffusion is publicly available in a user-friendly online Google Colaboratory notebook.5 Although many experienced scientists are applying RFdiffusion to their protein design efforts and validating their designs in the laboratory, “anyone with a browser” can design a protein that nature has never seen before on the computer…and share it on social media. No coding knowledge is necessary.Mohammed AlQuraishi, Assistant Professor of Systems Biology at Columbia UniversityProteins Beyond NatureBefore the AI revolution, protein design approaches were limited to generating designs based on nature's existing proteins. These standard methods had limitations, as nature has only sampled a small subset of the possible protein landscape, and evolution does not necessarily select for attributes that are desirable from a pharmaceutical or biotechnological standpoint. Solubility, stability, ease of production, and low immunogenicity are some of the many characteristics that are crucial from an application and scalability perspective.In contrast, generative AI approaches emphasize de novo protein design—designing new proteins from scratch—with the goal of expanding the repertoire of functions and desirable attributes beyond what nature has achieved. Since the landmark release of AlphaFold6—the acclaimed AI program from Google's DeepMind that made a grand leap in solving one of biology's biggest problems, determining a protein's 3D structure from its sequence—AI-powered protein design has been a rising force promising new possibilities for biotechnological applications.Historically, protein structure prediction and design were time-consuming processes due to low experimental validation rates for computationally derived structures. AI tools such as AlphaFold have allowed prediction of protein structures with unprecedented speed and accuracy, streamlining the research process for drug discovery, industrial applications, and more. In September, the developers of AlphaFold, Demis Hassabis and John Jumper, were among the winners of the 2023 Lasker Awards. This prestigious prize recognizes individuals who have made major contributions to medical science.During last month's The State of Biotech—GEN's annual flagship virtual event—renowned UW structural biologist David Baker discussed that the first approved de novo designed medicine, SKYCovione, a COVID vaccine developed by SK Bioscience and the UW Institute for Protein Design (IPD), was approved in June in South Korea for use in adults.“It's an exciting time for protein design!” emphasized Baker, who said the protein design field has seen a major shift from a predominantly biophysical approach, based on the idea that proteins fold to their lowest energy structures, to applying deep learning.Baker, who is a professor in biochemistry, the director of the UW IPD, and a Howard Hughes Medical Institute Investigator, is also a prolific biotech entrepreneur, having cofounded nine companies and serving as a scientific advisor to 18 others, according to the IPD's website.7In Need of a BackboneAs the name suggests, RFdiffusion leverages diffusion models, a generative AI approach that has seen considerable success in image generation tools, such as Midjourney, DALL-E 2 from OpenAI, and Stable Diffusion from Stability AI. These generative models learn the patterns of their training data and generate new outputs with similar characteristics.“We had these neural network architectures that had been useful for protein structure prediction, in particular RoseTTAFold and AlphaFold.6,8 The idea was that maybe these [tools] could be useful for design as well, but it wasn't immediately obvious how to do that in a way that could generate large diversity,” said Brian Trippe, a postdoctoral fellow based in Columbia University and co-advised by Baker. Trippe is another lead author of RFdiffusion.“In a design campaign, you need a bunch of different good options to order and test in the laboratory. The community was generally thinking, ‘how do we get real generative AI involved in protein design?’ People were moving toward diffusion models because of how strong they looked in image generation literature,” continued Juergens.The Baker laboratory's structure-forward approach for de novo protein design follows four steps (Fig. 1). First, a backbone, or a protein structure suggested to produce a specific function, is generated. Second, sequences are designed that are predicted to fold into the desired backbone. Third, the sequences from Step 2 are computational filtered for the top candidates that are most likely to succeed in Step 4, experimental validation.FIG. 1. A structure-forward approach for de novo protein design (Credit: Joseph Watson).(1) Generate a structural backbone with a suggested function. (2) Design sequences that are predicted to fold into the backbone from (1). (3) Computationally filter the sequences from (2) for the top candidates predicted to succeed in experimental validation. (4) Experimentally characterize the predicted design.With the rise of machine learning in protein design, many tools have been developed to facilitate Steps 2 and 3 of this workflow, such as ProteinMPNN9 for fixed backbone sequence design and AlphaFold and RoseTTAFold for computationally validating sequence folds, a method known as a self-consistency measure.Methods for diverse and high-throughput backbone generation (Step 1) have remained a bottleneck. Diffusion models are a natural fit to address this problem given that they can generate large numbers of diverse outputs, operate directly on amino acid coordinates, and condition on a wide range of inputs with the goal of specifying function.Experimentally Validated“We put an enormous amount of effort into working out the mathematical and statistical programming implementation details of this idea and our simulation results were looking just exciting enough to put out a preprint,” said Trippe.It was one week in December 2022, just before Baker posted the RFdiffusion preprint, when the experimental validation started coming in. “We were hearing [remarks from colleagues] that the binders we wanted to try were actually sticking. It seemed that everything that we had tried was working!” said Trippe.Helen Eisenach, a graduate student in the Baker laboratory and another co-lead author of RFdiffusion, emphasizes that one major advantage of RFdiffusion is the improved rate in which hypotheses are generated and tested in the laboratory.“Not only are you generating really good hypotheses in the computer, but then we're seeing that a lot of them are passing. Your rate of finding a successful design is just astronomically higher than a lot of other previous methods,” said Eisenach. “You have both speed on the compute side but also a lot higher accuracy on the experimental side.”She also notes that RFdiffusion is a step forward from previous backbone generation approaches from the Baker laboratory, such as “hallucination,” which performs well to produce designs but has a slower design process, and “inpainting,” which provides quick generation but lacks diversity in output.10Columbia's AlQuraishi concurs that one of the key advantages of diffusion models for protein design lies in the improved experimental success rates.“Prior to the ‘diffusion evolution’, the success rates were probably on the order of maybe 1 to 10,000, if you're lucky,” said AlQuraishi. “With diffusion models, the success rates are closer to the single percentages when you get into the laboratory. They're still not great but it's a huge magnitude improvement of what it used to be and it's been a really big deal.”AlQuraishi also indicates that these improved success rates are related to key conceptual differences that diffusion-based design brings to the table. Before diffusion, the lowest energy search approach to protein design, he says, was “an uninformative way to propose sequences. The vast majority of proposals are not correct and you're hoping that if you sample enough times, you'll happen upon something that works by chance.”“With diffusion models, it's more of a direct thing. You're not generating many hypotheses and then evaluating them in an uninformative way until, by luck, you happen onto something that works. The diffusion models take you to something that already works,” AlQuraishi continued.Learning a New LanguageStructure-forward approaches, such as diffusion models, are not the only machine learning tools infiltrating the protein design field. Large language models that take advantage of protein sequencing data are another rising tool. Earlier this year, researchers from Salesforce Research published ProGen, a protein language model trained on millions of raw protein sequences that generate artificial proteins across multiple families and functions.11Juergens notes that an advantage of sequence-based design is the ability to model function that cannot be described by a static structure.“At the IPD, we have a very structure-forward thought process because many functions can be described by one or a few well-defined structures. There are some molecular functions that have a flexing of a molecule or a shifting of a domain. Language models that reason in sequence space have the capability of implicitly modeling those functions that are described by more than just a single structure,” said Juergens.“A good analogy is that for large language models, [researchers] used to think that you had to model how reasoning worked. You had to do neuro symbolic reasoning, which is this emerging behavior of how intelligence is formed. [Researchers later] found out that if you just do a very simple task, you can actually model very complex structures in human reasoning,” said Jason Yim, a graduate student at MIT and another co-lead author of RFdiffusion.“The idea here is that if you train on all the sequence data for proteins, you can learn something about the higher order of how the proteins operate,” Yim continued. Another factor he says driving the advance of language models in protein design is the wealth of sequence data in comparison to structural data. “In structure, it can take a graduate student an entire PhD to solve a single structure, whereas we can generate sequences at a much faster rate with the current sequencing technology. There's so much data that's underutilized for design.”Helen Eisenach, David Juergens, Brian Trippe and Jason Yim (left to right) are co-lead authors of RFdiffusion who joined GEN Biotechnology for an interview to discuss the newest advances in protein design.The Ultimate ConvergenceAlthough the combination of sequence-based and structure-based machine learning approaches have opened new avenues for protein design, additional growth is on the horizon. AlQuraishi indicates that more work is needed for these models to accommodate complex instructions to generate structures for biotechnological applications.“An exciting frontier is conditional generation, where you don't just care about an end generation [that can be experimentally validated]. You care about specifying certain properties that are desirable for your protein,” said AlQuraishi. “Designing structures with fairly simple constraints [is currently achievable], but something more sophisticated, such as drug discovery, requires more complex constraints.”Building on this complexity, Juergens highlights that the field is also excited about higher order molecular functions that are not limited to protein atoms. “Trying to simulate molecular properties and functions that require, say a protein interacting with a metal, small molecule, or RNA or DNA is a growing area of focus for both structure prediction and design,” he said.Overall, the potential for AI tools to infiltrate all aspects of biotechnology continues to be boundless.“I could certainly see a point in time where [natural language models, such as ChatGPT,] are able to summarize papers and draw insights from the literature,” AlQuraishi remarks. “At some point, we may be able to interface molecular language models with natural language models, such that we can tell ChatGPT, ‘design your protein with those properties’, and then it goes and gives you a sequence.” That's some way off, “but that would be the ultimate convergence.”What new protein design barriers will AI break next? Only time will tell.References1. Watson JL, Juergens D, Bennett NR, et al. De novo design of protein structure and function with RFdiffusion. Nature 2023;620(7976):1089–1100; doi: 10.1038/s41586-023-06415-8 Crossref, Medline, Google Scholar2. Torres SV, Leung PJY, Lutz ID, et al. De novo design of high-affinity protein binders to bioactive helical peptides. bioRxiv 2022;2022.12.10.519862; doi: 10.1101/2022.12.10.519862 Crossref, Google Scholar3. Ingraham J, Baranov M, Costello Z, et al. Illuminating protein space with a programmable generative model. bioRxiv 2022;2022.12.01.518682; doi: 10.1101/2022.12.01.518682 Crossref, Google Scholar4. Lin Y, AlQuraishi M. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. arXiv 2023; doi: 10.48550/arXiv.2301.12485 Crossref, Google Scholar5. RFdiffusion. Available from: https://colab.research.google.com/github/sokrypton/ColabDesign/blob/v1.1.1/rf/examples/diffusion.ipynb [Last accessed: September 18, 2023]. Google Scholar6. Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021;596(7873):583–589; doi: 10.1038/s41586-021-03819-2 Crossref, Medline, Google Scholar7. Design UI for P. David Baker's Technology Transfer and Advisory Roles. 2023. Available from: https://www.ipd.uw.edu/baker-technology-transfer-roles/ [Last accessed: September 18, 2023]. Google Scholar8. Baek M, DiMaio F, Anishchenko I, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science (80-) 2021;373(6557):871–876; doi: 10.1126/science.abj8754 Crossref, Medline, Google Scholar9. Dauparas J, Anishchenko I, Bennett N, et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science (80-) 2022;378(6615):49–56; doi: 10.1126/science.add2187 Crossref, Medline, Google Scholar10. Wang J, Lisanza S, Juergens D, et al. Scaffolding protein functional sites using deep learning. Science (80-) 2022;377(6604):387–394; doi: 10.1126/science.abn2100 Crossref, Medline, Google Scholar11. Madani A, Krause B, Greene ER, et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol 2023;41(8):1099–1106; doi: 10.1038/s41587-022-01618-2 Crossref, Medline, Google ScholarFiguresReferencesRelatedDetails Volume 2Issue 5Oct 2023 InformationCopyright 2023, Mary Ann Liebert, Inc., publishersTo cite this article:Fay Lin.Diffusion Evolution: New Artificial Intelligence Models Break Barriers in Protein Design.GEN Biotechnology.Oct 2023.333-337.http://doi.org/10.1089/genbio.2023.29114.fliPublished in Volume: 2 Issue 5: October 16, 2023PDF download\",\"PeriodicalId\":73134,\"journal\":{\"name\":\"GEN biotechnology\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2023-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"GEN biotechnology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1089/genbio.2023.29114.fli\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"BIOTECHNOLOGY & APPLIED MICROBIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"GEN biotechnology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1089/genbio.2023.29114.fli","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0
摘要
创BiotechnologyVol。扩散进化:新的人工智能模型打破了蛋白质设计的障碍fay LinFay LinE-mail地址:[email protected] GEN biotechnology高级编辑搜索本文作者更多论文发布在线:2023年10月16日https://doi.org/10.1089/genbio.2023.29114.fliAboutSectionsPDF/EPUB权限与引用次数下载CitationsTrack引用次数添加到收藏返回发布分享分享在facebook上分享推特链接InRedditEmail扩散模型是一种生成式人工智能,是蛋白质设计的新兴工具。显示出改进的实验成功和生物技术应用的新潜力。这种蛋白质折叠是使用新的机器学习方法从零开始设计的数千种蛋白质折叠之一。2023年7月,华盛顿大学(UW) David Baker实验室的科学家们在《自然》杂志上发表了一篇报告,详细介绍了一种新的深度学习框架,用于从头开始的蛋白质设计,称为RoseTTAFold扩散(RFdiffusion)。从那时起,科学界就一直在谈论RFdiffusion前所未有的实验成功率和易用性。大卫·尤尔根斯(David Juergens)是贝克实验室的研究生,也是《自然》杂志那篇文章的七名共同主要作者之一,他分享了一个在中国实验室工作的科学家的轶事,他在社交媒体上发布了“他们如何在浏览器中设计一种蛋白质,对其排序,纯化蛋白质,使其结晶,然后得到一个晶体结构,与计算机上的设计相差半埃。”太神奇了!”杰庚斯告诉我的。David Baker, uww生物化学教授和蛋白质设计研究所主任,一些RFdiffusion的应用,在Nature文章中得到了实验验证,包括设计用于疫苗平台和递送载体的对称寡聚物,以及用于治疗的高亲和力结合物的生成在另一个项目中,Baker实验室应用射频扩散来设计结合肽激素的蛋白质——已建立的临床护理和生物医学研究的生物标志物——用于诊断应用。2箱1。让我们产生互动产生:生物医药是一家总部位于波士顿的治疗公司,在机器学习,生物工程和医学的交叉点。联合创始人兼首席战略和创新官莫莉·吉布森(Molly Gibson)表示,该公司专注于设计用于治疗用途的蛋白质-蛋白质相互作用。“如果你想到生物制剂,蛋白质最重要的功能是与目标产生非常特定和有效的结合。这可能是像抗体这样的东西,我们确切地知道我们想要在哪里中和一个目标,或者我们想要在哪里痛苦和增强功能,”吉布森说。Generate: Biomedicines的一个项目致力于为冠状病毒创造一种广泛中和的抗体。Gibson指出,病毒在生物制剂靶向的表位上主动变异,导致许多COVID治疗药物失去紧急使用授权(EUA)。“我们知道病毒的某些部分不会突变,但有趣的是,我们的免疫系统和动物的免疫系统,我们传统上获得的抗体通常不会产生针对病毒非突变部分的抗体,”吉布森继续说道。她补充说,针对这些非突变区域使治疗不太可能因未来的病毒突变而无效。今年9月,Generate: Biomedicines宣布了他们对GB-0669的首次临床试验,这是一种针对SARS-CoV-2刺突蛋白高度保守区域的单克隆抗体。该公司还希望在2023年第四季度初为其抗哮喘tslp单克隆抗体提交临床试验申请,预计此后不久将进入临床试验。生物医学有一个多模态的治疗重点项目在传染病,肿瘤学和免疫学。吉布森说:“我们真正专注于建立一套多样化的专业知识,不仅在蛋白质设计方面,而且在临床开发和生产方面。”通过整合不同领域的专业知识,“我们能够以影响人们的方式使用这项技术,”她补充说。其中一个关键工具是新的低温电子显微镜(CryoEM)设备,用于生成大规模结构数据,以补充公司内部的蛋白质设计机器学习工具,并促进药物发现过程。这个位于马萨诸塞州安多弗的70,000平方英尺的场地于6月揭幕,是美国最大的私营CryoEM实验室之一。贝克实验室并不是唯一一个开发所谓扩散模型的团队,扩散模型是一类利用生成式人工智能(AI)进行蛋白质设计的模型。去年12月,该实验室首次在bioRxiv上发布了一篇关于rf扩散的预印本。 与此同时,专注于人工智能的治疗公司Generate: Biomedicines发布了名为Chroma的扩散模型作为预印本一个月后,哥伦比亚大学系统生物学助理教授Mohammed AlQuraishi的实验室发布了他们自己的扩散模型genie的预印本。“所有这些不同的团队几乎在同一时间都在考虑这些模型,”AlQuraishi告诉GEN Biotechnology。“射频扩散效果很好。当然,这是(目前)已发表的方法中最有效的。”虽然Chroma不是公开的,但是Genie的代码是公开的。AlQuraishi还表示,Genie的实验验证正在进行中。RFdiffusion在一个用户友好的在线Google协作笔记本中公开提供尽管许多经验丰富的科学家正在将rf扩散应用于他们的蛋白质设计工作,并在实验室中验证他们的设计,但“任何有浏览器的人”都可以在计算机上设计出自然界从未见过的蛋白质……并在社交媒体上分享。不需要编码知识。哥伦比亚大学系统生物学助理教授Mohammed AlQuraishi在人工智能革命之前,蛋白质设计方法仅限于根据自然界现有的蛋白质生成设计。这些标准方法有局限性,因为大自然只对可能的蛋白质景观中的一小部分进行了采样,而且进化并不一定选择从制药或生物技术的角度来看所需要的属性。从应用和可扩展性的角度来看,溶解度、稳定性、易于生产和低免疫原性是许多至关重要的特性中的一些。相比之下,生成式人工智能方法强调从头开始的蛋白质设计——从头开始设计新的蛋白质——其目标是扩大功能和理想属性的范围,超越自然界已经实现的功能。自从具有里程碑意义的alphafold6发布以来,人工智能驱动的蛋白质设计一直是一股新兴力量,为生物技术应用带来了新的可能性。alphafold6是谷歌DeepMind备受赞誉的人工智能程序,在解决生物学最大的问题之一方面取得了重大飞跃,从序列中确定了蛋白质的3D结构。从历史上看,蛋白质结构的预测和设计是一个耗时的过程,因为计算得出的结构的实验验证率很低。像AlphaFold这样的人工智能工具可以以前所未有的速度和准确性预测蛋白质结构,简化药物发现、工业应用等方面的研究过程。9月,AlphaFold的开发者Demis Hassabis和John Jumper获得了2023年拉斯克奖(Lasker Awards)。这个享有盛誉的奖项旨在表彰对医学科学做出重大贡献的个人。在上个月的The State of Biotech-GEN年度旗舰虚拟活动中,著名的华盛顿大学结构生物学家David Baker讨论了首个获批的新设计药物SKYCovione,这是SK Bioscience和华盛顿大学蛋白质设计研究所(IPD)开发的一种COVID疫苗,于6月在韩国获批用于成人。“这是蛋白质设计的一个激动人心的时刻!Baker强调说,蛋白质设计领域已经看到了一个重大转变,从主要的生物物理方法,基于蛋白质折叠到最低能量结构的想法,到应用深度学习。贝克是生物化学教授、西澳大学生物科学研究所主任、霍华德·休斯医学研究所研究员,也是一位多产的生物技术企业家,根据生物技术研究所的网站,他与人共同创办了9家公司,并为另外18家公司担任科学顾问。顾名思义,RFdiffusion利用了扩散模型,这是一种生成式人工智能方法,在图像生成工具中取得了相当大的成功,例如Midjourney、OpenAI的DALL-E 2和Stability AI的Stable diffusion。这些生成模型学习训练数据的模式,并生成具有相似特征的新输出。“我们有这些神经网络架构,对蛋白质结构预测很有用,特别是rosettfold和AlphaFold。“我们的想法是,也许这些(工具)对设计也有用,但如何以一种能够产生大量多样性的方式做到这一点并不是很明显,”哥伦比亚大学博士后布莱恩·特里普(Brian Trippe)说,他是贝克的共同顾问。特里普是射频扩散的另一位主要作者。“在设计活动中,你需要一堆不同的好选择来订购和在实验室测试。整个社区普遍在想,‘我们如何让真正的生成式人工智能参与到蛋白质设计中?“人们正在转向扩散模型,因为它们在图像生成文献中看起来非常强大,”尤尔根斯继续说道。 有一些分子的功能有一个分子的弯曲或一个区域的移动。在序列空间中推理的语言模型具有隐式建模那些不仅仅由单一结构描述的功能的能力,”Juergens说。“一个很好的类比是,对于大型语言模型,[研究人员]过去认为你必须对推理如何运作进行建模。你必须进行神经符号推理,这是智力如何形成的新兴行为。(研究人员后来)发现,如果你只做一个非常简单的任务,你实际上可以在人类推理中模拟非常复杂的结构,”麻省理工学院的研究生、RFdiffusion的另一位共同主要作者杰森·严(Jason Yim)说。“这里的想法是,如果你对蛋白质的所有序列数据进行训练,你可以了解蛋白质如何运作的更高层次,”Yim继续说。他说,推动蛋白质设计语言模型进步的另一个因素是,与结构数据相比,序列数据更丰富。“在结构上,一个研究生可能需要一个完整的博士学位才能解决一个单一的结构,而我们可以用目前的测序技术以更快的速度生成序列。有太多的数据在设计中没有得到充分利用。”Helen Eisenach, David Juergens, Brian Trippe和Jason Yim(从左至右)是RFdiffusion的共同主要作者,他们加入GEN Biotechnology接受采访,讨论蛋白质设计的最新进展。尽管基于序列和基于结构的机器学习方法的结合为蛋白质设计开辟了新的途径,但更多的增长正在出现。AlQuraishi指出,这些模型还需要做更多的工作,以适应复杂的指令,生成用于生物技术应用的结构。“一个令人兴奋的前沿是条件生成,你不只是关心最终一代(可以通过实验验证)。你关心的是确定你的蛋白质所需要的某些特性。”“设计具有相当简单约束的结构(目前是可以实现的),但更复杂的东西,如药物发现,需要更复杂的约束。”基于这种复杂性,Juergens强调,该领域也对不限于蛋白质原子的高阶分子功能感到兴奋。他说:“试图模拟需要的分子特性和功能,比如蛋白质与金属、小分子、RNA或DNA的相互作用,是结构预测和设计的一个日益关注的领域。”总的来说,人工智能工具渗透到生物技术各个方面的潜力仍然是无限的。AlQuraishi评论说:“我当然可以看到[自然语言模型,如ChatGPT]能够总结论文并从文献中获得见解的时间点。”“在某种程度上,我们可能能够将分子语言模型与自然语言模型结合起来,这样我们就可以告诉ChatGPT,‘设计具有这些特性的蛋白质’,然后它就会给你一个序列。”这还有一段路要走,“但这将是最终的融合。”人工智能接下来将打破哪些新的蛋白质设计障碍?只有时间会告诉我们答案。张丽娟,张丽娟,张丽娟,等。用射频扩散技术重新设计蛋白质结构和功能。自然2023;620 (7976):1089 - 1100;doi: 10.1038/s41586-023-06415-8 Crossref, Medline, Google Scholar2。梁培军,李建军,李建军,等。生物活性螺旋肽高亲和蛋白结合物的重新设计。bioRxiv 2022.12.10.519862; 2022;doi: 10.1101/2022.12.10.519862 Crossref, Google Scholar3。Ingraham J, Baranov M, Costello Z,等。用可编程生成模型照亮蛋白质空间。bioRxiv 2022.12.01.518682; 2022;doi: 10.1101/2022.12.01.518682 Crossref, Google Scholar4。林勇,李建军,李建军,等。基于等量扩散取向残馀云的蛋白质结构研究。arXiv 2023;doi: 10.48550/arXiv.2301.12485 Crossref, Google Scholar5。RFdiffusion。可从:https://colab.research.google.com/github/sokrypton/ColabDesign/blob/v1.1.1/rf/examples/diffusion.ipynb[最后访问:2023年9月18日]。谷歌Scholar6。Jumper J, Evans R, Pritzel A等。高度精确的蛋白质结构预测与AlphaFold。自然2021;596 (7873):583 - 589;doi: 10.1038/s41586-021-03819-2 Crossref, Medline, Google Scholar7。设计UI P. David Baker的技术转移和咨询角色。2023。可从:https://www.ipd.uw.edu/baker-technology-transfer-roles/[最后访问:2023年9月18日]。谷歌Scholar8。Baek M, DiMaio F, Anishchenko I,等。利用三轨道神经网络精确预测蛋白质结构和相互作用。科学(80-)2021;373(6557):871-876;doi: 10.1126 /科学。abj8754 Crossref, Medline, Google Scholar9。张建军,张建军,张建军,等。 基于ProteinMPNN的鲁棒深度学习蛋白质序列设计。Science (80-) 2022;378(6615): 49-56;doi: 10.1126 /科学。add2187 Crossref, Medline, Google Scholar10。王军,李珊珊,Juergens D,等。利用深度学习搭建蛋白功能位点。Science (80-) 2022;377(6604): 387-394;doi: 10.1126 /科学。abn2100 Crossref, Medline, Google Scholar11。Madani A, Krause B, Greene ER,等。大型语言模型生成跨不同家族的功能性蛋白质序列。中国生物医学工程学报;2009;41(8):1099-1106;doi: 10.1038/s41587-022-01618-2 Crossref, Medline, Google ScholarFiguresReferencesRelatedDetails Volume 2Issue 5Oct 2023 InformationCopyright 2023, Mary Ann Liebert, Inc., publishers扩散进化:新的人工智能模型打破了蛋白质设计的障碍。创生物技术。Oct 2023.333-337.http://doi.org/10.1089/genbio.2023.29114.fliPublished in Volume: 2 Issue 5: October 16, 2023PDF下载
Diffusion Evolution: New Artificial Intelligence Models Break Barriers in Protein Design
GEN BiotechnologyVol. 2, No. 5 News FeaturesFree AccessDiffusion Evolution: New Artificial Intelligence Models Break Barriers in Protein DesignFay LinFay LinE-mail Address: [email protected]Senior Editor, GEN BiotechnologySearch for more papers by this authorPublished Online:16 Oct 2023https://doi.org/10.1089/genbio.2023.29114.fliAboutSectionsPDF/EPUB Permissions & CitationsPermissionsDownload CitationsTrack CitationsAdd to favorites Back To Publication ShareShare onFacebookTwitterLinked InRedditEmail Diffusion models, a form of generative artificial intelligence, are a rising tool for protein design, showing improved experimental success and new potential for biotechnological applications.This protein fold is one of thousands designed from scratch using new machine learning methods. (Credit: Ian C. Haydon/UW Institute for Protein Design)In July 2023, scientists in David Baker's laboratory at the University of Washington (UW) published a report in Nature detailing a new deep-learning framework for de novo protein design called RoseTTAFold diffusion (RFdiffusion), in Nature.1 Since then, the scientific community has been buzzing about RFdiffusion's unprecedented experimental success rate and ease of use.David Juergens, a graduate student in Baker's laboratory and one of seven co-lead authors of the Nature article, shared an anecdote about a scientist working in a lab in China, who posted on social media how “they designed a protein in a browser, ordered the sequence, purified the protein, crystallized it, and then got a crystal structure that was half an angstrom away from the design that was on the computer. It was amazing!” Juergens told me.David Baker, Professor in Biochemistry and Director of the Institute for Protein Design at UWSome of the applications of RFdiffusion, documented with experimental validation in the Nature article, include design of symmetric oligomers for vaccine platforms and delivery vehicles and generation of high-affinity binders for therapeutics.1 In another project, the Baker laboratory has applied RFdiffusion to design proteins that bind peptide hormones—established biomarkers for clinical care and biomedical research—for diagnostic applications.2Box 1. Let's Generate interactionsGenerate: Biomedicines is a Boston-based therapeutics company at the intersection of machine learning, biological engineering, and medicine. Molly Gibson, cofounder and chief strategy and innovation officer, says the company focuses on designing protein–protein interactions for therapeutic applications.“If you think about biologics, the most important function that a protein takes is creating very specific and potent binding with its target. This could be things like an antibody where we know exactly where we want to neutralize a target, or where we want to agonize and potentiate function,” said Gibson.One project at Generate: Biomedicines has worked to create a broadly neutralizing antibody for coronavirus. Gibson notes that the virus actively mutates on the epitope targeted by biologics, leading to many COVID therapeutics losing emergency use authorization (EUA).“We know that there are some parts of the virus that just don't mutate, but interestingly, our immune systems and the immune system of animals where we traditionally get antibodies don't create antibodies commonly against the non-mutating part of the virus,” Gibson continued.She adds that targeting these nonmutating areas makes the therapeutic less likely to be made ineffective by future virus mutations.In September, Generate: Biomedicines announced their first clinical trial for GB-0669, a monoclonal antibody targeting a highly conserved region of the spike protein in SARS-CoV-2. The company also expects to file a Clinical Trial Application by early Q4 2023 for its anti-TSLP monoclonal antibody in asthma, which is expected to enter clinical trials shortly thereafter.Generate: Biomedicines has a multimodality therapeutic focus with projects in infectious disease, oncology, and immunology. “We've really focused on building a diverse set of expertise, not just in protein design, but also in clinical development and manufacturing,” Gibson says. By integrating various areas of expertise, “we're able to use this technology in ways that impact people,” she added.One key tool is a new cryogenic electron microscopy (CryoEM) facility for generating large-scale structural data to complement the company's in-house protein design machine learning tools and facilitate the drug discovery process. Unveiled in June, this 70,000 square-feet site in Andover, Massachusetts, is among the largest privately owned CryoEM laboratories in the United States.The Baker laboratory is not the only group developing so-called diffusion models, a class of models that leverage generative artificial intelligence (AI), for protein design. The laboratory first posted a preprint on RFdiffusion on bioRxiv last December. At the same time, the AI-focused therapeutics company Generate: Biomedicines posted its diffusion model, called Chroma, as a preprint.3 Chroma is one of many platforms offered by the company (Box 1).A month later, the laboratory of Mohammed AlQuraishi, assistant professor of systems biology at Columbia University, posted a preprint on their own diffusion model, Genie.4 “All of these various groups have been thinking about these models at around the same time,” AlQuraishi told GEN Biotechnology. “RFdiffusion works pretty well. Certainly, it's the most well validated among the published methods [at this time].”Although Chroma is not publicly available, the code for Genie is available for public use. AlQuraishi also states that experimental validation for Genie is underway.RFdiffusion is publicly available in a user-friendly online Google Colaboratory notebook.5 Although many experienced scientists are applying RFdiffusion to their protein design efforts and validating their designs in the laboratory, “anyone with a browser” can design a protein that nature has never seen before on the computer…and share it on social media. No coding knowledge is necessary.Mohammed AlQuraishi, Assistant Professor of Systems Biology at Columbia UniversityProteins Beyond NatureBefore the AI revolution, protein design approaches were limited to generating designs based on nature's existing proteins. These standard methods had limitations, as nature has only sampled a small subset of the possible protein landscape, and evolution does not necessarily select for attributes that are desirable from a pharmaceutical or biotechnological standpoint. Solubility, stability, ease of production, and low immunogenicity are some of the many characteristics that are crucial from an application and scalability perspective.In contrast, generative AI approaches emphasize de novo protein design—designing new proteins from scratch—with the goal of expanding the repertoire of functions and desirable attributes beyond what nature has achieved. Since the landmark release of AlphaFold6—the acclaimed AI program from Google's DeepMind that made a grand leap in solving one of biology's biggest problems, determining a protein's 3D structure from its sequence—AI-powered protein design has been a rising force promising new possibilities for biotechnological applications.Historically, protein structure prediction and design were time-consuming processes due to low experimental validation rates for computationally derived structures. AI tools such as AlphaFold have allowed prediction of protein structures with unprecedented speed and accuracy, streamlining the research process for drug discovery, industrial applications, and more. In September, the developers of AlphaFold, Demis Hassabis and John Jumper, were among the winners of the 2023 Lasker Awards. This prestigious prize recognizes individuals who have made major contributions to medical science.During last month's The State of Biotech—GEN's annual flagship virtual event—renowned UW structural biologist David Baker discussed that the first approved de novo designed medicine, SKYCovione, a COVID vaccine developed by SK Bioscience and the UW Institute for Protein Design (IPD), was approved in June in South Korea for use in adults.“It's an exciting time for protein design!” emphasized Baker, who said the protein design field has seen a major shift from a predominantly biophysical approach, based on the idea that proteins fold to their lowest energy structures, to applying deep learning.Baker, who is a professor in biochemistry, the director of the UW IPD, and a Howard Hughes Medical Institute Investigator, is also a prolific biotech entrepreneur, having cofounded nine companies and serving as a scientific advisor to 18 others, according to the IPD's website.7In Need of a BackboneAs the name suggests, RFdiffusion leverages diffusion models, a generative AI approach that has seen considerable success in image generation tools, such as Midjourney, DALL-E 2 from OpenAI, and Stable Diffusion from Stability AI. These generative models learn the patterns of their training data and generate new outputs with similar characteristics.“We had these neural network architectures that had been useful for protein structure prediction, in particular RoseTTAFold and AlphaFold.6,8 The idea was that maybe these [tools] could be useful for design as well, but it wasn't immediately obvious how to do that in a way that could generate large diversity,” said Brian Trippe, a postdoctoral fellow based in Columbia University and co-advised by Baker. Trippe is another lead author of RFdiffusion.“In a design campaign, you need a bunch of different good options to order and test in the laboratory. The community was generally thinking, ‘how do we get real generative AI involved in protein design?’ People were moving toward diffusion models because of how strong they looked in image generation literature,” continued Juergens.The Baker laboratory's structure-forward approach for de novo protein design follows four steps (Fig. 1). First, a backbone, or a protein structure suggested to produce a specific function, is generated. Second, sequences are designed that are predicted to fold into the desired backbone. Third, the sequences from Step 2 are computational filtered for the top candidates that are most likely to succeed in Step 4, experimental validation.FIG. 1. A structure-forward approach for de novo protein design (Credit: Joseph Watson).(1) Generate a structural backbone with a suggested function. (2) Design sequences that are predicted to fold into the backbone from (1). (3) Computationally filter the sequences from (2) for the top candidates predicted to succeed in experimental validation. (4) Experimentally characterize the predicted design.With the rise of machine learning in protein design, many tools have been developed to facilitate Steps 2 and 3 of this workflow, such as ProteinMPNN9 for fixed backbone sequence design and AlphaFold and RoseTTAFold for computationally validating sequence folds, a method known as a self-consistency measure.Methods for diverse and high-throughput backbone generation (Step 1) have remained a bottleneck. Diffusion models are a natural fit to address this problem given that they can generate large numbers of diverse outputs, operate directly on amino acid coordinates, and condition on a wide range of inputs with the goal of specifying function.Experimentally Validated“We put an enormous amount of effort into working out the mathematical and statistical programming implementation details of this idea and our simulation results were looking just exciting enough to put out a preprint,” said Trippe.It was one week in December 2022, just before Baker posted the RFdiffusion preprint, when the experimental validation started coming in. “We were hearing [remarks from colleagues] that the binders we wanted to try were actually sticking. It seemed that everything that we had tried was working!” said Trippe.Helen Eisenach, a graduate student in the Baker laboratory and another co-lead author of RFdiffusion, emphasizes that one major advantage of RFdiffusion is the improved rate in which hypotheses are generated and tested in the laboratory.“Not only are you generating really good hypotheses in the computer, but then we're seeing that a lot of them are passing. Your rate of finding a successful design is just astronomically higher than a lot of other previous methods,” said Eisenach. “You have both speed on the compute side but also a lot higher accuracy on the experimental side.”She also notes that RFdiffusion is a step forward from previous backbone generation approaches from the Baker laboratory, such as “hallucination,” which performs well to produce designs but has a slower design process, and “inpainting,” which provides quick generation but lacks diversity in output.10Columbia's AlQuraishi concurs that one of the key advantages of diffusion models for protein design lies in the improved experimental success rates.“Prior to the ‘diffusion evolution’, the success rates were probably on the order of maybe 1 to 10,000, if you're lucky,” said AlQuraishi. “With diffusion models, the success rates are closer to the single percentages when you get into the laboratory. They're still not great but it's a huge magnitude improvement of what it used to be and it's been a really big deal.”AlQuraishi also indicates that these improved success rates are related to key conceptual differences that diffusion-based design brings to the table. Before diffusion, the lowest energy search approach to protein design, he says, was “an uninformative way to propose sequences. The vast majority of proposals are not correct and you're hoping that if you sample enough times, you'll happen upon something that works by chance.”“With diffusion models, it's more of a direct thing. You're not generating many hypotheses and then evaluating them in an uninformative way until, by luck, you happen onto something that works. The diffusion models take you to something that already works,” AlQuraishi continued.Learning a New LanguageStructure-forward approaches, such as diffusion models, are not the only machine learning tools infiltrating the protein design field. Large language models that take advantage of protein sequencing data are another rising tool. Earlier this year, researchers from Salesforce Research published ProGen, a protein language model trained on millions of raw protein sequences that generate artificial proteins across multiple families and functions.11Juergens notes that an advantage of sequence-based design is the ability to model function that cannot be described by a static structure.“At the IPD, we have a very structure-forward thought process because many functions can be described by one or a few well-defined structures. There are some molecular functions that have a flexing of a molecule or a shifting of a domain. Language models that reason in sequence space have the capability of implicitly modeling those functions that are described by more than just a single structure,” said Juergens.“A good analogy is that for large language models, [researchers] used to think that you had to model how reasoning worked. You had to do neuro symbolic reasoning, which is this emerging behavior of how intelligence is formed. [Researchers later] found out that if you just do a very simple task, you can actually model very complex structures in human reasoning,” said Jason Yim, a graduate student at MIT and another co-lead author of RFdiffusion.“The idea here is that if you train on all the sequence data for proteins, you can learn something about the higher order of how the proteins operate,” Yim continued. Another factor he says driving the advance of language models in protein design is the wealth of sequence data in comparison to structural data. “In structure, it can take a graduate student an entire PhD to solve a single structure, whereas we can generate sequences at a much faster rate with the current sequencing technology. There's so much data that's underutilized for design.”Helen Eisenach, David Juergens, Brian Trippe and Jason Yim (left to right) are co-lead authors of RFdiffusion who joined GEN Biotechnology for an interview to discuss the newest advances in protein design.The Ultimate ConvergenceAlthough the combination of sequence-based and structure-based machine learning approaches have opened new avenues for protein design, additional growth is on the horizon. AlQuraishi indicates that more work is needed for these models to accommodate complex instructions to generate structures for biotechnological applications.“An exciting frontier is conditional generation, where you don't just care about an end generation [that can be experimentally validated]. You care about specifying certain properties that are desirable for your protein,” said AlQuraishi. “Designing structures with fairly simple constraints [is currently achievable], but something more sophisticated, such as drug discovery, requires more complex constraints.”Building on this complexity, Juergens highlights that the field is also excited about higher order molecular functions that are not limited to protein atoms. “Trying to simulate molecular properties and functions that require, say a protein interacting with a metal, small molecule, or RNA or DNA is a growing area of focus for both structure prediction and design,” he said.Overall, the potential for AI tools to infiltrate all aspects of biotechnology continues to be boundless.“I could certainly see a point in time where [natural language models, such as ChatGPT,] are able to summarize papers and draw insights from the literature,” AlQuraishi remarks. “At some point, we may be able to interface molecular language models with natural language models, such that we can tell ChatGPT, ‘design your protein with those properties’, and then it goes and gives you a sequence.” That's some way off, “but that would be the ultimate convergence.”What new protein design barriers will AI break next? Only time will tell.References1. Watson JL, Juergens D, Bennett NR, et al. De novo design of protein structure and function with RFdiffusion. Nature 2023;620(7976):1089–1100; doi: 10.1038/s41586-023-06415-8 Crossref, Medline, Google Scholar2. Torres SV, Leung PJY, Lutz ID, et al. De novo design of high-affinity protein binders to bioactive helical peptides. bioRxiv 2022;2022.12.10.519862; doi: 10.1101/2022.12.10.519862 Crossref, Google Scholar3. Ingraham J, Baranov M, Costello Z, et al. Illuminating protein space with a programmable generative model. bioRxiv 2022;2022.12.01.518682; doi: 10.1101/2022.12.01.518682 Crossref, Google Scholar4. Lin Y, AlQuraishi M. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. arXiv 2023; doi: 10.48550/arXiv.2301.12485 Crossref, Google Scholar5. RFdiffusion. Available from: https://colab.research.google.com/github/sokrypton/ColabDesign/blob/v1.1.1/rf/examples/diffusion.ipynb [Last accessed: September 18, 2023]. Google Scholar6. Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021;596(7873):583–589; doi: 10.1038/s41586-021-03819-2 Crossref, Medline, Google Scholar7. Design UI for P. David Baker's Technology Transfer and Advisory Roles. 2023. Available from: https://www.ipd.uw.edu/baker-technology-transfer-roles/ [Last accessed: September 18, 2023]. Google Scholar8. Baek M, DiMaio F, Anishchenko I, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science (80-) 2021;373(6557):871–876; doi: 10.1126/science.abj8754 Crossref, Medline, Google Scholar9. Dauparas J, Anishchenko I, Bennett N, et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science (80-) 2022;378(6615):49–56; doi: 10.1126/science.add2187 Crossref, Medline, Google Scholar10. Wang J, Lisanza S, Juergens D, et al. Scaffolding protein functional sites using deep learning. Science (80-) 2022;377(6604):387–394; doi: 10.1126/science.abn2100 Crossref, Medline, Google Scholar11. Madani A, Krause B, Greene ER, et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol 2023;41(8):1099–1106; doi: 10.1038/s41587-022-01618-2 Crossref, Medline, Google ScholarFiguresReferencesRelatedDetails Volume 2Issue 5Oct 2023 InformationCopyright 2023, Mary Ann Liebert, Inc., publishersTo cite this article:Fay Lin.Diffusion Evolution: New Artificial Intelligence Models Break Barriers in Protein Design.GEN Biotechnology.Oct 2023.333-337.http://doi.org/10.1089/genbio.2023.29114.fliPublished in Volume: 2 Issue 5: October 16, 2023PDF download