Viktor Domazetoski, Holger Kreft, Helena Bestova, Philipp Wieder, Radoslav Koynov, Alireza Zarei, Patrick Weigelt
{"title":"Using large language models to extract plant functional traits from unstructured text","authors":"Viktor Domazetoski, Holger Kreft, Helena Bestova, Philipp Wieder, Radoslav Koynov, Alireza Zarei, Patrick Weigelt","doi":"10.1002/aps3.70011","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Premise</h3>\n \n <p>Functional plant ecology seeks to understand how functional traits govern species distributions, community assembly, and ecosystem functions. While global trait datasets have advanced the field, substantial gaps remain, and extracting trait information from text in books, research articles, and online sources via machine learning offers a valuable complement to costly field campaigns.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We propose a natural language processing pipeline that extracts traits from unstructured species descriptions by using classification models for categorical traits and question-answering models for numerical traits. The pipeline's performance is evaluated on two large databases with over 50,000 species descriptions, utilizing approaches ranging from a keyword search to large language models.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Our final optimized pipeline used a transformer architecture and obtained a mean precision of 90.8% (range 81.6–97%) and a mean recall of 88.6% (77.4–97%) across five categorical traits, representing a 9.83% increase in precision and 42.35% increase in recall over a regular expression-based approach. The question-answering model yielded a normalized mean absolute error of 10.3% averaged across three numerical traits.</p>\n </section>\n \n <section>\n \n <h3> Discussion</h3>\n \n <p>The natural language processing pipeline we propose has the potential to facilitate the digitization and extraction of large amounts of plant functional trait information residing in scattered textual descriptions.</p>\n </section>\n </div>","PeriodicalId":8022,"journal":{"name":"Applications in Plant Sciences","volume":"13 3","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/aps3.70011","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applications in Plant Sciences","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/aps3.70011","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PLANT SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Premise
Functional plant ecology seeks to understand how functional traits govern species distributions, community assembly, and ecosystem functions. While global trait datasets have advanced the field, substantial gaps remain, and extracting trait information from text in books, research articles, and online sources via machine learning offers a valuable complement to costly field campaigns.
Methods
We propose a natural language processing pipeline that extracts traits from unstructured species descriptions by using classification models for categorical traits and question-answering models for numerical traits. The pipeline's performance is evaluated on two large databases with over 50,000 species descriptions, utilizing approaches ranging from a keyword search to large language models.
Results
Our final optimized pipeline used a transformer architecture and obtained a mean precision of 90.8% (range 81.6–97%) and a mean recall of 88.6% (77.4–97%) across five categorical traits, representing a 9.83% increase in precision and 42.35% increase in recall over a regular expression-based approach. The question-answering model yielded a normalized mean absolute error of 10.3% averaged across three numerical traits.
Discussion
The natural language processing pipeline we propose has the potential to facilitate the digitization and extraction of large amounts of plant functional trait information residing in scattered textual descriptions.
期刊介绍:
Applications in Plant Sciences (APPS) is a monthly, peer-reviewed, open access journal promoting the rapid dissemination of newly developed, innovative tools and protocols in all areas of the plant sciences, including genetics, structure, function, development, evolution, systematics, and ecology. Given the rapid progress today in technology and its application in the plant sciences, the goal of APPS is to foster communication within the plant science community to advance scientific research. APPS is a publication of the Botanical Society of America, originating in 2009 as the American Journal of Botany''s online-only section, AJB Primer Notes & Protocols in the Plant Sciences.
APPS publishes the following types of articles: (1) Protocol Notes describe new methods and technological advancements; (2) Genomic Resources Articles characterize the development and demonstrate the usefulness of newly developed genomic resources, including transcriptomes; (3) Software Notes detail new software applications; (4) Application Articles illustrate the application of a new protocol, method, or software application within the context of a larger study; (5) Review Articles evaluate available techniques, methods, or protocols; (6) Primer Notes report novel genetic markers with evidence of wide applicability.