FuncFetch: llm辅助的工作流程可以从已发表的手稿中挖掘数千种酶-底物相互作用。

Nathaniel Smith, Xinyu Yuan, Chesney Melissinos, Gaurav Moghe
{"title":"FuncFetch: llm辅助的工作流程可以从已发表的手稿中挖掘数千种酶-底物相互作用。","authors":"Nathaniel Smith, Xinyu Yuan, Chesney Melissinos, Gaurav Moghe","doi":"10.1093/bioinformatics/btae756","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Thousands of genomes are publicly available, however, most genes in those genomes have poorly defined functions. This is partly due to a gap between previously published, experimentally characterized protein activities and activities deposited in databases. This activity deposition is bottlenecked by the time-consuming biocuration process. The emergence of large language models presents an opportunity to speed up the text-mining of protein activities for biocuration.</p><p><strong>Results: </strong>We developed FuncFetch-a workflow that integrates NCBI E-Utilities, OpenAI's GPT-4, and Zotero-to screen thousands of manuscripts and extract enzyme activities. Extensive validation revealed high precision and recall of GPT-4 in determining whether the abstract of a given paper indicates the presence of a characterized enzyme activity in that paper. Provided the manuscript, FuncFetch extracted data such as species information, enzyme names, sequence identifiers, substrates, and products, which were subjected to extensive quality analyses. Comparison of this workflow against a manually curated dataset of BAHD acyltransferase activities demonstrated a precision/recall of 0.86/0.64 in extracting substrates. We further deployed FuncFetch on nine large plant enzyme families. Screening 26 543 papers, FuncFetch retrieved 32 605 entries from 5459 selected papers. We also identified multiple extraction errors including incorrect associations, nontarget enzymes, and hallucinations, which highlight the need for further manual curation. The BAHD activities were verified, resulting in a comprehensive functional fingerprint of this family and revealing that ∼70% of the experimentally characterized enzymes are uncurated in the public domain. FuncFetch represents an advance in biocuration and lays the groundwork for predicting the functions of uncharacterized enzymes.</p><p><strong>Availability and implementation: </strong>Code and minimally curated activities are available at: https://github.com/moghelab/funcfetch and https://tools.moghelab.org/funczymedb.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11734755/pdf/","citationCount":"0","resultStr":"{\"title\":\"FuncFetch: an LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts.\",\"authors\":\"Nathaniel Smith, Xinyu Yuan, Chesney Melissinos, Gaurav Moghe\",\"doi\":\"10.1093/bioinformatics/btae756\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Motivation: </strong>Thousands of genomes are publicly available, however, most genes in those genomes have poorly defined functions. This is partly due to a gap between previously published, experimentally characterized protein activities and activities deposited in databases. This activity deposition is bottlenecked by the time-consuming biocuration process. The emergence of large language models presents an opportunity to speed up the text-mining of protein activities for biocuration.</p><p><strong>Results: </strong>We developed FuncFetch-a workflow that integrates NCBI E-Utilities, OpenAI's GPT-4, and Zotero-to screen thousands of manuscripts and extract enzyme activities. Extensive validation revealed high precision and recall of GPT-4 in determining whether the abstract of a given paper indicates the presence of a characterized enzyme activity in that paper. Provided the manuscript, FuncFetch extracted data such as species information, enzyme names, sequence identifiers, substrates, and products, which were subjected to extensive quality analyses. Comparison of this workflow against a manually curated dataset of BAHD acyltransferase activities demonstrated a precision/recall of 0.86/0.64 in extracting substrates. We further deployed FuncFetch on nine large plant enzyme families. Screening 26 543 papers, FuncFetch retrieved 32 605 entries from 5459 selected papers. We also identified multiple extraction errors including incorrect associations, nontarget enzymes, and hallucinations, which highlight the need for further manual curation. The BAHD activities were verified, resulting in a comprehensive functional fingerprint of this family and revealing that ∼70% of the experimentally characterized enzymes are uncurated in the public domain. FuncFetch represents an advance in biocuration and lays the groundwork for predicting the functions of uncharacterized enzymes.</p><p><strong>Availability and implementation: </strong>Code and minimally curated activities are available at: https://github.com/moghelab/funcfetch and https://tools.moghelab.org/funczymedb.</p>\",\"PeriodicalId\":93899,\"journal\":{\"name\":\"Bioinformatics (Oxford, England)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-12-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11734755/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics (Oxford, England)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btae756\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae756","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

动机:成千上万的基因组是公开的,然而,这些基因组中的大多数基因都没有明确的功能。这部分是由于先前发表的、实验表征的蛋白质活性与存储在数据库中的活性之间存在差距。这种活性沉积是耗时的生物固化过程的瓶颈。大型语言模型(llm)的出现为加速蛋白质活动的文本挖掘提供了机会。结果:我们开发了funcfetch -一个集成NCBI E-Utilities、OpenAI的GPT-4和zotero的工作流程来筛选数千份手稿并提取酶活性。广泛的验证表明,在确定给定论文的摘要是否表明该论文中存在表征酶活性时,GPT-4具有高精度和召回率。提供稿件后,FuncFetch提取了物种信息、酶名、序列标识符、底物和产物等数据,并对这些数据进行了广泛的质量分析。将该工作流程与人工编制的BAHD酰基转移酶活性数据集进行比较,发现提取底物的精度/召回率为0.86/0.64。我们进一步将FuncFetch应用于9个大型植物酶家族。通过筛选26543篇论文,FuncFetch从5459篇入选论文中检索到32605个条目。我们还发现了多种提取错误,包括不正确的关联、非靶酶和幻觉,这突出了进一步手工处理的必要性。对BAHD活性进行了验证,得到了该家族的全面功能指纹图谱,并揭示了约70%的实验表征酶在公共领域未被整理。FuncFetch代表了生物标记技术的进步,为预测未表征酶的功能奠定了基础。可获得性和实施:代码和最低限度策划的活动可在:https://github.com/moghelab/funcfetch和https://tools.moghelab.org/funczymedb.Supplementary上获得信息:补充数据可在生物信息学在线上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
FuncFetch: an LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts.

Motivation: Thousands of genomes are publicly available, however, most genes in those genomes have poorly defined functions. This is partly due to a gap between previously published, experimentally characterized protein activities and activities deposited in databases. This activity deposition is bottlenecked by the time-consuming biocuration process. The emergence of large language models presents an opportunity to speed up the text-mining of protein activities for biocuration.

Results: We developed FuncFetch-a workflow that integrates NCBI E-Utilities, OpenAI's GPT-4, and Zotero-to screen thousands of manuscripts and extract enzyme activities. Extensive validation revealed high precision and recall of GPT-4 in determining whether the abstract of a given paper indicates the presence of a characterized enzyme activity in that paper. Provided the manuscript, FuncFetch extracted data such as species information, enzyme names, sequence identifiers, substrates, and products, which were subjected to extensive quality analyses. Comparison of this workflow against a manually curated dataset of BAHD acyltransferase activities demonstrated a precision/recall of 0.86/0.64 in extracting substrates. We further deployed FuncFetch on nine large plant enzyme families. Screening 26 543 papers, FuncFetch retrieved 32 605 entries from 5459 selected papers. We also identified multiple extraction errors including incorrect associations, nontarget enzymes, and hallucinations, which highlight the need for further manual curation. The BAHD activities were verified, resulting in a comprehensive functional fingerprint of this family and revealing that ∼70% of the experimentally characterized enzymes are uncurated in the public domain. FuncFetch represents an advance in biocuration and lays the groundwork for predicting the functions of uncharacterized enzymes.

Availability and implementation: Code and minimally curated activities are available at: https://github.com/moghelab/funcfetch and https://tools.moghelab.org/funczymedb.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信