AlphaMissense, a groundbreaking advancement in artificial intelligence for predicting the effects of missense variants

MedComm - Future medicine Pub Date : 2024-01-05 DOI:10.1002/mef2.70

Ming Yi, Yunqiang Liu, Zhiguang Su

{"title":"AlphaMissense, a groundbreaking advancement in artificial intelligence for predicting the effects of missense variants","authors":"Ming Yi, Yunqiang Liu, Zhiguang Su","doi":"10.1002/mef2.70","DOIUrl":null,"url":null,"abstract":"In a recent study published in Science,1 Cheng and colleagues developed a highly accurate protein structuring model named AlphaMissense, which can predict and characterize the pathogenicity of all possible missense variants in the human genome at a single amino acid substitution level. As a community resource, AlphaMissense is absolutely helping us to gain better insights into the functional consequences of genetic variation.Despite the identification of over 4 million missense variants in the human genome, only approximately 2% are definitively annotated as pathogenic or benign, and the significance of the large proportion of missense variants is unknown. As such, there has been a push to search for highly effective methods to accurately predict the variants' clinical implications.Presently, four primary methodologies have been used to predict the pathogenicity of genetic variations. The first class of methods is known as “database-driven approaches,” which rely extensively on meticulously curated databases. Such strategies suffer from data leakage caused by unintended information transfer between the training and test halves, posing a significant challenge to reliability and accuracy.1, 2 The second class of methods is referred to as “weak-labeling approaches,” which circumvent circularity concerns by eliminating human annotations. However, such models often encounter false labels in the training data, necessitating the use of more reliable labels for accurate evaluation. A third class of approaches focuses on the recognition of naturally evolved amino acid sequence distributions and hidden structures of proteins, providing insights into the evolutionary patterns and functional characteristics of proteins.1 Such models, however, do not possess the advanced understanding of protein structure achieved by AlphaFold (AF).3 A fourth approach utilizes protein structure information to improve the assessment of genetic constraints. However, this approach encounters a new challenge in the accuracy of predicting variant pathogenicity based solely on structural features. The limited performance of the structure-based approach in predicting pathogenicity in ClinVar variations suggests that additional factors, such as functional annotations as well as population frequencies and clinical evidence, play a crucial role in determining the pathogenicity of a genetic variant.1AlphaMissense, constructed upon the protein structure prediction model of AF, is a machine-learning model that utilizes advancements in unsupervised protein language modeling (Figure 1). AF represents a groundbreaking method that enables the prediction of a protein's three-dimensional structure solely from its amino acid sequence.3 By incorporating the structural insights provided by AF, researchers have achieved notable advancements in accurately assessing the potential pathogenic impact of genetic variations.4AlphaMissense incorporates structural context from AF-derived systems and fine-tunes using weak labels obtained from population frequency data. Remarkably, this model achieves state-of-the-art predictions in clinical annotation, identification of de novo disease variants, and experimental benchmarks, even without specific training on datasets tailored to these tasks. One notable feature of AlphaMissense lies in its capacity to provide predictions for a wide range of genetic variations, particularly those that have not been supported by experimental data. AlphaMissense fills this gap by providing predictions for these variants, enabling researchers to gain insights into their potential functional consequences.In addition to the capacity to predict pathogenicity, AlphaMissense has shown potential in predicting the essentiality of a gene for cell survival or fitness. The databases produced by AlphaMissense are accessible to the scientific community, providing valuable resources that include forecasts unique to 60,000 alternative transcripts.1 The availability of accurate predictions for single-amino acid alterations through AlphaMissence empowers researchers to prioritize variants for further experimental investigation, reducing the need for extensive and costly experimental characterization of each variant. This accelerates the research process and enables a more efficient allocation of resources.While AlphaMissense exhibits efficacy in predicting pathogenicity via scalar values, certain limitations warrant consideration. First, AlphaMissense utilizes wild-type structural predictions but does not directly provide detailed structural change information for altered sequences. Additional analyses and techniques are needed to investigate structural implications. Integrating multimodal and experimental data remains imperative to obtain a comprehensive understanding of structural ramifications and influences on protein stability and function. Second, AlphaMissense does not explicitly predict missense variants' impacts on biophysical properties, like, stability and binding affinity. Third, AlphaMissense does not incorporate training to account for potential interactions with other proteins during pathogenicity prediction. It is also limited to single amino acid substitutions, not encompassing more complex variations. Performance may be limited for de novo proteins lacking evolutionary information, as the model relies heavily on multiple sequence alignments for structure prediction and evolutionary conservation estimation. Additionally, it is also important to acknowledge calibration score deviations from expected probabilities observed in ClinVar, suggesting that predicted probabilities may not always perfectly align with clinical pathogenicity observations. These limitations underscore the need for further modeling advances to address more complex variations and improve pathogenicity prediction for de novo proteins. Ultimately, it is important to note that the concept of mutation pathogenicity is exceedingly complex with numerous contributing factors. Computational prediction models often oversimplify this complexity, omitting important considerations, such as inheritance pattern, allelic state, and incomplete penetrance.5 In particular, the phenomenon of incomplete penetrance underscores that not all carriers of a purportedly “pathogenic” allele will necessarily manifest clinical disease. Rather, one's genetic background and environmental exposures can modulate the phenotypic effects of mutations. Fully accounting for such nuances remains a notable challenge for in silico approaches. A more comprehensive integration of diverse data, including family history, genetic context, and environmental interactions, may help advance computational pathogenicity predictions by more closely approximating the intricate interplay of biological and environmental determinants that collectively influence disease risk and expression in the real world.In summary, as a significantly noteworthy advancement in the field of protein variant analysis, AlphaMissense predicts the possible impact of every amino acid substitution in the human proteome and classifies 89% of missense variants as either likely benign or likely pathogenic. In addition, AlphaMissense interprets the effects of every possible missense mutation in the human genome on protein structure and function based on the insights obtained from protein structure analysis. Overall, by incorporating the pathogenic classification of missense mutations of unknown significance into protein structure-predicting models, AlphaMissense holds promise to identify where disease-causing mutations are likely to occur in a protein.Ming Yi and Yunqiang Liu drafted the manuscript, Zhiguang Su revised the manuscript. All authors have read and approved the final manuscript.The authors declare no conflict of interest.Not applicable.","PeriodicalId":74135,"journal":{"name":"MedComm - Future medicine","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/mef2.70","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MedComm - Future medicine","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/mef2.70","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In a recent study published in Science,¹ Cheng and colleagues developed a highly accurate protein structuring model named AlphaMissense, which can predict and characterize the pathogenicity of all possible missense variants in the human genome at a single amino acid substitution level. As a community resource, AlphaMissense is absolutely helping us to gain better insights into the functional consequences of genetic variation.

Despite the identification of over 4 million missense variants in the human genome, only approximately 2% are definitively annotated as pathogenic or benign, and the significance of the large proportion of missense variants is unknown. As such, there has been a push to search for highly effective methods to accurately predict the variants' clinical implications.

Presently, four primary methodologies have been used to predict the pathogenicity of genetic variations. The first class of methods is known as “database-driven approaches,” which rely extensively on meticulously curated databases. Such strategies suffer from data leakage caused by unintended information transfer between the training and test halves, posing a significant challenge to reliability and accuracy.^{1, 2} The second class of methods is referred to as “weak-labeling approaches,” which circumvent circularity concerns by eliminating human annotations. However, such models often encounter false labels in the training data, necessitating the use of more reliable labels for accurate evaluation. A third class of approaches focuses on the recognition of naturally evolved amino acid sequence distributions and hidden structures of proteins, providing insights into the evolutionary patterns and functional characteristics of proteins.¹ Such models, however, do not possess the advanced understanding of protein structure achieved by AlphaFold (AF).³ A fourth approach utilizes protein structure information to improve the assessment of genetic constraints. However, this approach encounters a new challenge in the accuracy of predicting variant pathogenicity based solely on structural features. The limited performance of the structure-based approach in predicting pathogenicity in ClinVar variations suggests that additional factors, such as functional annotations as well as population frequencies and clinical evidence, play a crucial role in determining the pathogenicity of a genetic variant.¹

AlphaMissense, constructed upon the protein structure prediction model of AF, is a machine-learning model that utilizes advancements in unsupervised protein language modeling (Figure 1). AF represents a groundbreaking method that enables the prediction of a protein's three-dimensional structure solely from its amino acid sequence.³ By incorporating the structural insights provided by AF, researchers have achieved notable advancements in accurately assessing the potential pathogenic impact of genetic variations.⁴

AlphaMissense incorporates structural context from AF-derived systems and fine-tunes using weak labels obtained from population frequency data. Remarkably, this model achieves state-of-the-art predictions in clinical annotation, identification of de novo disease variants, and experimental benchmarks, even without specific training on datasets tailored to these tasks. One notable feature of AlphaMissense lies in its capacity to provide predictions for a wide range of genetic variations, particularly those that have not been supported by experimental data. AlphaMissense fills this gap by providing predictions for these variants, enabling researchers to gain insights into their potential functional consequences.

In addition to the capacity to predict pathogenicity, AlphaMissense has shown potential in predicting the essentiality of a gene for cell survival or fitness. The databases produced by AlphaMissense are accessible to the scientific community, providing valuable resources that include forecasts unique to 60,000 alternative transcripts.¹ The availability of accurate predictions for single-amino acid alterations through AlphaMissence empowers researchers to prioritize variants for further experimental investigation, reducing the need for extensive and costly experimental characterization of each variant. This accelerates the research process and enables a more efficient allocation of resources.

While AlphaMissense exhibits efficacy in predicting pathogenicity via scalar values, certain limitations warrant consideration. First, AlphaMissense utilizes wild-type structural predictions but does not directly provide detailed structural change information for altered sequences. Additional analyses and techniques are needed to investigate structural implications. Integrating multimodal and experimental data remains imperative to obtain a comprehensive understanding of structural ramifications and influences on protein stability and function. Second, AlphaMissense does not explicitly predict missense variants' impacts on biophysical properties, like, stability and binding affinity. Third, AlphaMissense does not incorporate training to account for potential interactions with other proteins during pathogenicity prediction. It is also limited to single amino acid substitutions, not encompassing more complex variations. Performance may be limited for de novo proteins lacking evolutionary information, as the model relies heavily on multiple sequence alignments for structure prediction and evolutionary conservation estimation. Additionally, it is also important to acknowledge calibration score deviations from expected probabilities observed in ClinVar, suggesting that predicted probabilities may not always perfectly align with clinical pathogenicity observations. These limitations underscore the need for further modeling advances to address more complex variations and improve pathogenicity prediction for de novo proteins. Ultimately, it is important to note that the concept of mutation pathogenicity is exceedingly complex with numerous contributing factors. Computational prediction models often oversimplify this complexity, omitting important considerations, such as inheritance pattern, allelic state, and incomplete penetrance.⁵ In particular, the phenomenon of incomplete penetrance underscores that not all carriers of a purportedly “pathogenic” allele will necessarily manifest clinical disease. Rather, one's genetic background and environmental exposures can modulate the phenotypic effects of mutations. Fully accounting for such nuances remains a notable challenge for in silico approaches. A more comprehensive integration of diverse data, including family history, genetic context, and environmental interactions, may help advance computational pathogenicity predictions by more closely approximating the intricate interplay of biological and environmental determinants that collectively influence disease risk and expression in the real world.

In summary, as a significantly noteworthy advancement in the field of protein variant analysis, AlphaMissense predicts the possible impact of every amino acid substitution in the human proteome and classifies 89% of missense variants as either likely benign or likely pathogenic. In addition, AlphaMissense interprets the effects of every possible missense mutation in the human genome on protein structure and function based on the insights obtained from protein structure analysis. Overall, by incorporating the pathogenic classification of missense mutations of unknown significance into protein structure-predicting models, AlphaMissense holds promise to identify where disease-causing mutations are likely to occur in a protein.

Ming Yi and Yunqiang Liu drafted the manuscript, Zhiguang Su revised the manuscript. All authors have read and approved the final manuscript.

The authors declare no conflict of interest.

Not applicable.

Abstract Image

查看原文本刊更多论文

AlphaMissense，人工智能在预测错义变异影响方面的突破性进展

在最近发表于《科学》（Science）1 的一项研究中，Cheng 及其同事开发了一种名为 AlphaMissense 的高精度蛋白质结构模型，它可以在单个氨基酸替换水平上预测人类基因组中所有可能的错义变异并描述其致病性。作为一种社区资源，AlphaMissense 绝对有助于我们更好地了解遗传变异的功能性后果。尽管在人类基因组中发现了 400 多万个错义变异，但只有约 2% 被明确注释为致病性或良性变异，而大部分错义变异的意义尚不清楚。因此，人们一直在寻找高效的方法来准确预测变异的临床意义。目前，主要有四种方法用于预测基因变异的致病性。第一类方法被称为 "数据库驱动法"，广泛依赖于精心策划的数据库。这类方法会因训练和测试两部分之间无意的信息转移而造成数据泄露，对可靠性和准确性构成巨大挑战。然而，这类模型在训练数据中经常会遇到错误标签，因此需要使用更可靠的标签来进行准确评估。第三类方法侧重于识别自然进化的氨基酸序列分布和蛋白质的隐藏结构，从而深入了解蛋白质的进化模式和功能特征1 。3 第四种方法是利用蛋白质结构信息来改进对遗传限制的评估。然而，这种方法在仅根据结构特征预测变体致病性的准确性方面遇到了新的挑战。基于结构的方法在预测 ClinVar 变异致病性方面的有限表现表明，功能注释以及群体频率和临床证据等其他因素在确定遗传变异的致病性方面起着至关重要的作用。AF 是一种开创性的方法，它能仅通过氨基酸序列预测蛋白质的三维结构。3 通过结合 AF 提供的结构洞察力，研究人员在准确评估基因变异的潜在致病影响方面取得了显著进步。4AlphaMissense 结合了 AF 衍生系统的结构背景，并使用从群体频率数据中获得的弱标签进行微调。值得注意的是，该模型在临床注释、识别新发疾病变异和实验基准方面实现了最先进的预测，即使没有针对这些任务定制的数据集的特定训练也是如此。AlphaMissense 的一个显著特点在于它能够预测各种基因变异，尤其是那些没有实验数据支持的变异。AlphaMissense 通过预测这些变异填补了这一空白，使研究人员能够深入了解其潜在的功能性后果。除了预测致病性的能力外，AlphaMissense 在预测基因对细胞生存或健康的重要性方面也显示出了潜力。AlphaMissense 生成的数据库可供科学界访问，提供了宝贵的资源，其中包括对 60,000 个替代转录本的独特预测。1 通过 AlphaMissence 可以准确预测单氨基酸变异，这使研究人员有能力优先对变异进行进一步的实验研究，从而减少了对每个变异进行大量昂贵的实验鉴定的需要。虽然 AlphaMissense 在通过标度值预测致病性方面表现出了功效，但某些局限性值得考虑。首先，AlphaMissense 利用野生型结构预测，但不能直接提供改变序列的详细结构变化信息。还需要更多的分析和技术来研究结构的影响。要全面了解蛋白质稳定性和功能的结构影响，整合多模态数据和实验数据仍是当务之急。其次，AlphaMissense 不能明确预测错义变体对生物物理特性（如稳定性和结合亲和力）的影响。第三，在致病性预测过程中，AlphaMissense 没有结合训练来考虑与其他蛋白质的潜在相互作用。此外，它还局限于单个氨基酸的置换，不包括更复杂的变异。对于缺乏进化信息的新蛋白质，其性能可能会受到限制，因为该模型在很大程度上依赖于多序列比对来进行结构预测和进化保护估计。此外，还必须承认在 ClinVar 中观察到的校准分数与预期概率存在偏差，这表明预测概率不一定完全符合临床致病性观察结果。这些局限性凸显了进一步改进建模的必要性，以解决更复杂的变异并改进对新蛋白的致病性预测。最后，必须指出的是，变异致病性的概念极其复杂，有许多促成因素。计算预测模型往往将这种复杂性过于简单化，忽略了一些重要的考虑因素，如遗传模式、等位基因状态和不完全渗透性5。相反，一个人的遗传背景和环境暴露会调节突变的表型效应。充分考虑这些细微差别仍然是硅学方法面临的一个显著挑战。总之，作为蛋白质变异分析领域一项值得关注的重大进展，AlphaMissense 预测了人类蛋白质组中每个氨基酸置换可能产生的影响，并将 89% 的错义变异分类为可能良性或可能致病。此外，AlphaMissense 还能根据蛋白质结构分析得出的结论，解释人类基因组中每一个可能的错义突变对蛋白质结构和功能的影响。总之，通过将意义不明的错义突变的致病性分类纳入蛋白质结构预测模型，AlphaMissense有望确定蛋白质中可能出现致病突变的位置。所有作者均已阅读并认可最终稿件。作者声明无利益冲突。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

MedComm - Future medicine

CiteScore

1.00

自引率

0.00%

发文量