Using large language models to extract plant functional traits from unstructured text

IF 2.4 3区生物学 Q2 PLANT SCIENCES

Applications in Plant Sciences Pub Date : 2025-06-03 DOI:10.1002/aps3.70011

Viktor Domazetoski, Holger Kreft, Helena Bestova, Philipp Wieder, Radoslav Koynov, Alireza Zarei, Patrick Weigelt

{"title":"Using large language models to extract plant functional traits from unstructured text","authors":"Viktor Domazetoski, Holger Kreft, Helena Bestova, Philipp Wieder, Radoslav Koynov, Alireza Zarei, Patrick Weigelt","doi":"10.1002/aps3.70011","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Premise</h3>\n \n <p>Functional plant ecology seeks to understand how functional traits govern species distributions, community assembly, and ecosystem functions. While global trait datasets have advanced the field, substantial gaps remain, and extracting trait information from text in books, research articles, and online sources via machine learning offers a valuable complement to costly field campaigns.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We propose a natural language processing pipeline that extracts traits from unstructured species descriptions by using classification models for categorical traits and question-answering models for numerical traits. The pipeline's performance is evaluated on two large databases with over 50,000 species descriptions, utilizing approaches ranging from a keyword search to large language models.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Our final optimized pipeline used a transformer architecture and obtained a mean precision of 90.8% (range 81.6–97%) and a mean recall of 88.6% (77.4–97%) across five categorical traits, representing a 9.83% increase in precision and 42.35% increase in recall over a regular expression-based approach. The question-answering model yielded a normalized mean absolute error of 10.3% averaged across three numerical traits.</p>\n </section>\n \n <section>\n \n <h3> Discussion</h3>\n \n <p>The natural language processing pipeline we propose has the potential to facilitate the digitization and extraction of large amounts of plant functional trait information residing in scattered textual descriptions.</p>\n </section>\n </div>","PeriodicalId":8022,"journal":{"name":"Applications in Plant Sciences","volume":"13 3","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/aps3.70011","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applications in Plant Sciences","FirstCategoryId":"99","ListUrlMain":"https://bsapubs.onlinelibrary.wiley.com/doi/10.1002/aps3.70011","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PLANT SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Premise

Functional plant ecology seeks to understand how functional traits govern species distributions, community assembly, and ecosystem functions. While global trait datasets have advanced the field, substantial gaps remain, and extracting trait information from text in books, research articles, and online sources via machine learning offers a valuable complement to costly field campaigns.

Methods

We propose a natural language processing pipeline that extracts traits from unstructured species descriptions by using classification models for categorical traits and question-answering models for numerical traits. The pipeline's performance is evaluated on two large databases with over 50,000 species descriptions, utilizing approaches ranging from a keyword search to large language models.

Results

Our final optimized pipeline used a transformer architecture and obtained a mean precision of 90.8% (range 81.6–97%) and a mean recall of 88.6% (77.4–97%) across five categorical traits, representing a 9.83% increase in precision and 42.35% increase in recall over a regular expression-based approach. The question-answering model yielded a normalized mean absolute error of 10.3% averaged across three numerical traits.

Discussion

The natural language processing pipeline we propose has the potential to facilitate the digitization and extraction of large amounts of plant functional trait information residing in scattered textual descriptions.

Abstract Image

查看原文本刊更多论文

利用大型语言模型从非结构化文本中提取植物功能特征

前提功能植物生态学旨在了解功能性状如何控制物种分布，群落组装和生态系统功能。虽然全球特征数据集已经推动了该领域的发展，但仍然存在巨大的差距，通过机器学习从书籍、研究文章和在线资源中的文本中提取特征信息，为昂贵的实地活动提供了有价值的补充。方法提出了一种自然语言处理管道，通过分类模型提取非结构化物种描述中的分类特征，采用问答模型提取数量特征。管道的性能在两个拥有超过50,000个物种描述的大型数据库上进行评估，使用的方法从关键字搜索到大型语言模型。结果我们最终优化的管道使用了变压器架构，在五个分类特征上获得了90.8%（范围81.6-97%）的平均精度和88.6%（78.4 - 97%）的平均召回率，比基于正则表达式的方法提高了9.83%的精度和42.35%的召回率。该问答模型在三个数值特征上的平均标准化平均绝对误差为10.3%。我们提出的自然语言处理管道有可能促进分散在文本描述中的大量植物功能性状信息的数字化和提取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applications in Plant Sciences PLANT SCIENCES-

CiteScore

7.30

自引率

0.00%

发文量

审稿时长

12 weeks

期刊介绍： Applications in Plant Sciences (APPS) is a monthly, peer-reviewed, open access journal promoting the rapid dissemination of newly developed, innovative tools and protocols in all areas of the plant sciences, including genetics, structure, function, development, evolution, systematics, and ecology. Given the rapid progress today in technology and its application in the plant sciences, the goal of APPS is to foster communication within the plant science community to advance scientific research. APPS is a publication of the Botanical Society of America, originating in 2009 as the American Journal of Botany''s online-only section, AJB Primer Notes & Protocols in the Plant Sciences. APPS publishes the following types of articles: (1) Protocol Notes describe new methods and technological advancements; (2) Genomic Resources Articles characterize the development and demonstrate the usefulness of newly developed genomic resources, including transcriptomes; (3) Software Notes detail new software applications; (4) Application Articles illustrate the application of a new protocol, method, or software application within the context of a larger study; (5) Review Articles evaluate available techniques, methods, or protocols; (6) Primer Notes report novel genetic markers with evidence of wide applicability.