利用大型语言模型扩展数据库和提取标记数据，推进植物代谢研究

IF 2.4 3区生物学 Q2 PLANT SCIENCES

Applications in Plant Sciences Pub Date : 2025-05-14 DOI:10.1002/aps3.70007

Rachel Knapp, Braidon Johnson, Lucas Busta

{"title":"利用大型语言模型扩展数据库和提取标记数据，推进植物代谢研究","authors":"Rachel Knapp, Braidon Johnson, Lucas Busta","doi":"10.1002/aps3.70007","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Premise</h3>\n \n <p>Recently, plant science has seen transformative advances in scalable data collection for sequence and chemical data. These large datasets, combined with machine learning, have demonstrated that conducting plant metabolic research on large scales yields remarkable insights. A key next step in increasing scale has been revealed with the advent of accessible large language models, which, even in their early stages, can distill structured data from the literature. This brings us closer to creating specialized databases that consolidate virtually all published knowledge on a topic.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Here, we first test different combinations of prompt engineering techniques and language models in the identification of validated enzyme–product pairs. Next, we evaluate the application of automated prompt engineering and retrieval-augmented generation to identify compound–species associations. Finally, we build and determine the accuracy of a multimodal language model–based pipeline that transcribes images of tables into machine-readable formats.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>When tuned for each specific task, these methods perform with high (80–90%) or modest (50%) accuracies for enzyme–product pair identification and table image transcription, but with lower false-negative rates than previous methods (decreasing from 55% to 40%) for compound–species pair identification.</p>\n </section>\n \n <section>\n \n <h3> Discussion</h3>\n \n <p>We enumerate several suggestions for researchers working with language models, among which is the importance of the user's domain-specific expertise and knowledge.</p>\n </section>\n </div>","PeriodicalId":8022,"journal":{"name":"Applications in Plant Sciences","volume":"13 4","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/aps3.70007","citationCount":"0","resultStr":"{\"title\":\"Advancing plant metabolic research by using large language models to expand databases and extract labeled data\",\"authors\":\"Rachel Knapp, Braidon Johnson, Lucas Busta\",\"doi\":\"10.1002/aps3.70007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n \\n <section>\\n \\n <h3> Premise</h3>\\n \\n <p>Recently, plant science has seen transformative advances in scalable data collection for sequence and chemical data. These large datasets, combined with machine learning, have demonstrated that conducting plant metabolic research on large scales yields remarkable insights. A key next step in increasing scale has been revealed with the advent of accessible large language models, which, even in their early stages, can distill structured data from the literature. This brings us closer to creating specialized databases that consolidate virtually all published knowledge on a topic.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Methods</h3>\\n \\n <p>Here, we first test different combinations of prompt engineering techniques and language models in the identification of validated enzyme–product pairs. Next, we evaluate the application of automated prompt engineering and retrieval-augmented generation to identify compound–species associations. Finally, we build and determine the accuracy of a multimodal language model–based pipeline that transcribes images of tables into machine-readable formats.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Results</h3>\\n \\n <p>When tuned for each specific task, these methods perform with high (80–90%) or modest (50%) accuracies for enzyme–product pair identification and table image transcription, but with lower false-negative rates than previous methods (decreasing from 55% to 40%) for compound–species pair identification.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Discussion</h3>\\n \\n <p>We enumerate several suggestions for researchers working with language models, among which is the importance of the user's domain-specific expertise and knowledge.</p>\\n </section>\\n </div>\",\"PeriodicalId\":8022,\"journal\":{\"name\":\"Applications in Plant Sciences\",\"volume\":\"13 4\",\"pages\":\"\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-05-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/aps3.70007\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applications in Plant Sciences\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://bsapubs.onlinelibrary.wiley.com/doi/10.1002/aps3.70007\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PLANT SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applications in Plant Sciences","FirstCategoryId":"99","ListUrlMain":"https://bsapubs.onlinelibrary.wiley.com/doi/10.1002/aps3.70007","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PLANT SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

最近，植物科学在序列和化学数据的可扩展数据收集方面取得了革命性的进展。这些大型数据集与机器学习相结合，表明在大规模上进行植物代谢研究可以产生非凡的见解。随着可访问的大型语言模型的出现，揭示了增加规模的关键下一步，即使在早期阶段，也可以从文献中提取结构化数据。这使我们离创建专门的数据库更近了一步，这些数据库实际上整合了关于一个主题的所有已发表的知识。在这里，我们首先测试了提示工程技术和语言模型的不同组合，以识别已验证的酶-产物对。接下来，我们评估了自动提示工程和检索增强生成在识别化合物物种关联方面的应用。最后，我们构建并确定了一个基于多模态语言模型的管道的准确性，该管道将表的图像转录为机器可读的格式。当对每个特定任务进行调整时，这些方法在酶产物对鉴定和表图像转录方面具有高（80-90%）或中等（50%）的准确性，但在化合物物种对鉴定方面的假阴性率低于以前的方法（从55%降至40%）。我们为研究语言模型的研究人员列举了一些建议，其中包括用户特定领域的专业知识和知识的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Advancing plant metabolic research by using large language models to expand databases and extract labeled data

查看原文本刊更多论文

Advancing plant metabolic research by using large language models to expand databases and extract labeled data

Premise

Recently, plant science has seen transformative advances in scalable data collection for sequence and chemical data. These large datasets, combined with machine learning, have demonstrated that conducting plant metabolic research on large scales yields remarkable insights. A key next step in increasing scale has been revealed with the advent of accessible large language models, which, even in their early stages, can distill structured data from the literature. This brings us closer to creating specialized databases that consolidate virtually all published knowledge on a topic.

Methods

Here, we first test different combinations of prompt engineering techniques and language models in the identification of validated enzyme–product pairs. Next, we evaluate the application of automated prompt engineering and retrieval-augmented generation to identify compound–species associations. Finally, we build and determine the accuracy of a multimodal language model–based pipeline that transcribes images of tables into machine-readable formats.

Results

When tuned for each specific task, these methods perform with high (80–90%) or modest (50%) accuracies for enzyme–product pair identification and table image transcription, but with lower false-negative rates than previous methods (decreasing from 55% to 40%) for compound–species pair identification.

Discussion

We enumerate several suggestions for researchers working with language models, among which is the importance of the user's domain-specific expertise and knowledge.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applications in Plant Sciences PLANT SCIENCES-

CiteScore

7.30

自引率

0.00%

发文量

审稿时长

12 weeks

期刊介绍： Applications in Plant Sciences (APPS) is a monthly, peer-reviewed, open access journal promoting the rapid dissemination of newly developed, innovative tools and protocols in all areas of the plant sciences, including genetics, structure, function, development, evolution, systematics, and ecology. Given the rapid progress today in technology and its application in the plant sciences, the goal of APPS is to foster communication within the plant science community to advance scientific research. APPS is a publication of the Botanical Society of America, originating in 2009 as the American Journal of Botany''s online-only section, AJB Primer Notes & Protocols in the Plant Sciences. APPS publishes the following types of articles: (1) Protocol Notes describe new methods and technological advancements; (2) Genomic Resources Articles characterize the development and demonstrate the usefulness of newly developed genomic resources, including transcriptomes; (3) Software Notes detail new software applications; (4) Application Articles illustrate the application of a new protocol, method, or software application within the context of a larger study; (5) Review Articles evaluate available techniques, methods, or protocols; (6) Primer Notes report novel genetic markers with evidence of wide applicability.