Limitations of Current Machine-Learning Models in Predicting Enzymatic Functions for Uncharacterized Proteins.

IF 2.2 3区生物学 Q3 GENETICS & HEREDITY

G3: Genes|Genomes|Genetics Pub Date : 2025-07-24 DOI:10.1093/g3journal/jkaf169

Valérie de Crécy-Lagard, Raquel Dias, Nick Sexson, Iddo Friedberg, Yifeng Yuan, Manal A Swairjo

{"title":"Limitations of Current Machine-Learning Models in Predicting Enzymatic Functions for Uncharacterized Proteins.","authors":"Valérie de Crécy-Lagard, Raquel Dias, Nick Sexson, Iddo Friedberg, Yifeng Yuan, Manal A Swairjo","doi":"10.1093/g3journal/jkaf169","DOIUrl":null,"url":null,"abstract":"<p><p>Thirty to seventy percent of proteins in any given genome have no assigned function and have been labeled as the protein \"unknome\". This large knowledge shortfall is one of the final frontiers of biology. Machine-Learning (ML) approaches are enticing, with early successes demonstrating the ability to propagate functional knowledge from experimentally characterized proteins. An open question is the ability of machine-learning approaches to predict enzymatic functions unseen in the training sets. By integrating literature and a combination of bioinformatic approaches, we evaluated individually Enzyme Commission number predictions for over 450 Escherichia coli unknowns made using state-of-the-art machine-learning approaches. We found that current ML methods not only mostly fail to make novel predictions but also make basic logic errors in their predictions that human annotators avoid by leveraging the available knowledge base. This underscores the need to include assessments of prediction uncertainty in model output and to test for 'hallucinations' (logic failures) as a part of model evaluation. Explainable AI (XAI) analysis can be used to identify indicators of prediction errors, potentially identifying the most relevant data to include in the next generation of computational models.</p>","PeriodicalId":12468,"journal":{"name":"G3: Genes|Genomes|Genetics","volume":" ","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"G3: Genes|Genomes|Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/g3journal/jkaf169","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Thirty to seventy percent of proteins in any given genome have no assigned function and have been labeled as the protein "unknome". This large knowledge shortfall is one of the final frontiers of biology. Machine-Learning (ML) approaches are enticing, with early successes demonstrating the ability to propagate functional knowledge from experimentally characterized proteins. An open question is the ability of machine-learning approaches to predict enzymatic functions unseen in the training sets. By integrating literature and a combination of bioinformatic approaches, we evaluated individually Enzyme Commission number predictions for over 450 Escherichia coli unknowns made using state-of-the-art machine-learning approaches. We found that current ML methods not only mostly fail to make novel predictions but also make basic logic errors in their predictions that human annotators avoid by leveraging the available knowledge base. This underscores the need to include assessments of prediction uncertainty in model output and to test for 'hallucinations' (logic failures) as a part of model evaluation. Explainable AI (XAI) analysis can be used to identify indicators of prediction errors, potentially identifying the most relevant data to include in the next generation of computational models.

查看原文本刊更多论文

当前机器学习模型在预测未表征蛋白质的酶功能方面的局限性。

在任何给定的基因组中，有30%到70%的蛋白质没有指定的功能，被标记为“未知”蛋白质。这种巨大的知识缺口是生物学最后的前沿之一。机器学习（ML）方法很有吸引力，早期的成功证明了从实验表征的蛋白质中传播功能知识的能力。一个悬而未决的问题是机器学习方法预测训练集中看不到的酶功能的能力。通过整合文献和生物信息学方法的组合，我们单独评估了酶委员会使用最先进的机器学习方法对450多种未知大肠杆菌的数量预测。我们发现，当前的机器学习方法不仅无法做出新颖的预测，而且还会在预测中犯基本的逻辑错误，而人类注释者通过利用可用的知识库来避免这些错误。这强调了需要在模型输出中包括预测不确定性的评估，并将“幻觉”（逻辑故障）测试作为模型评估的一部分。可解释人工智能（XAI）分析可用于识别预测误差指标，潜在地识别出最相关的数据，以包括在下一代计算模型中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

G3: Genes|Genomes|Genetics GENETICS & HEREDITY-

CiteScore

5.10

自引率

3.80%

发文量

305

审稿时长

3-8 weeks

期刊介绍： G3: Genes, Genomes, Genetics provides a forum for the publication of high‐quality foundational research, particularly research that generates useful genetic and genomic information such as genome maps, single gene studies, genome‐wide association and QTL studies, as well as genome reports, mutant screens, and advances in methods and technology. The Editorial Board of G3 believes that rapid dissemination of these data is the necessary foundation for analysis that leads to mechanistic insights. G3, published by the Genetics Society of America, meets the critical and growing need of the genetics community for rapid review and publication of important results in all areas of genetics. G3 offers the opportunity to publish the puzzling finding or to present unpublished results that may not have been submitted for review and publication due to a perceived lack of a potential high-impact finding. G3 has earned the DOAJ Seal, which is a mark of certification for open access journals, awarded by DOAJ to journals that achieve a high level of openness, adhere to Best Practice and high publishing standards.