用于生产高质量植物标本馆数字记录的集成自动化方法

IF 2.4 3区生物学 Q2 PLANT SCIENCES

Applications in Plant Sciences Pub Date : 2024-11-05 DOI:10.1002/aps3.11623

Robert P. Guralnick, Raphael LaFrance, Julie M. Allen, Michael W. Denslow

{"title":"用于生产高质量植物标本馆数字记录的集成自动化方法","authors":"Robert P. Guralnick, Raphael LaFrance, Julie M. Allen, Michael W. Denslow","doi":"10.1002/aps3.11623","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Premise</h3>\n \n <p>One of the slowest steps in digitizing natural history collections is converting labels associated with specimens into a digital data record usable for collections management and research. Here, we address how herbarium specimen labels can be converted into digital data records via extraction into standardized Darwin Core fields.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We first showcase the development of a rule-based approach and compare outcomes with a large language model–based approach, in particular ChatGPT4. We next quantified omission and commission error rates across target fields for a set of labels transcribed using optical character recognition (OCR) for both approaches. For example, we find that ChatGPT4 often creates field names that are not Darwin Core compliant while rule-based approaches often have high commission error rates.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Our results suggest that these approaches each have different strengths and limitations. We therefore developed an ensemble approach that leverages the strengths of each individual method and documented that ensembling strongly reduced overall information extraction errors.</p>\n </section>\n \n <section>\n \n <h3> Discussion</h3>\n \n <p>This work shows that an ensemble approach has particular value for creating high-quality digital data records, even for complicated label content. While human validation is still needed to ensure the best possible quality, automated approaches can speed digitization of herbarium specimen labels and are likely to be broadly usable for all natural history collection types.</p>\n </section>\n </div>","PeriodicalId":8022,"journal":{"name":"Applications in Plant Sciences","volume":"13 1","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/aps3.11623","citationCount":"0","resultStr":"{\"title\":\"Ensemble automated approaches for producing high-quality herbarium digital records\",\"authors\":\"Robert P. Guralnick, Raphael LaFrance, Julie M. Allen, Michael W. Denslow\",\"doi\":\"10.1002/aps3.11623\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n \\n <section>\\n \\n <h3> Premise</h3>\\n \\n <p>One of the slowest steps in digitizing natural history collections is converting labels associated with specimens into a digital data record usable for collections management and research. Here, we address how herbarium specimen labels can be converted into digital data records via extraction into standardized Darwin Core fields.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Methods</h3>\\n \\n <p>We first showcase the development of a rule-based approach and compare outcomes with a large language model–based approach, in particular ChatGPT4. We next quantified omission and commission error rates across target fields for a set of labels transcribed using optical character recognition (OCR) for both approaches. For example, we find that ChatGPT4 often creates field names that are not Darwin Core compliant while rule-based approaches often have high commission error rates.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Results</h3>\\n \\n <p>Our results suggest that these approaches each have different strengths and limitations. We therefore developed an ensemble approach that leverages the strengths of each individual method and documented that ensembling strongly reduced overall information extraction errors.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Discussion</h3>\\n \\n <p>This work shows that an ensemble approach has particular value for creating high-quality digital data records, even for complicated label content. While human validation is still needed to ensure the best possible quality, automated approaches can speed digitization of herbarium specimen labels and are likely to be broadly usable for all natural history collection types.</p>\\n </section>\\n </div>\",\"PeriodicalId\":8022,\"journal\":{\"name\":\"Applications in Plant Sciences\",\"volume\":\"13 1\",\"pages\":\"\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-11-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/aps3.11623\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applications in Plant Sciences\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://bsapubs.onlinelibrary.wiley.com/doi/10.1002/aps3.11623\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PLANT SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applications in Plant Sciences","FirstCategoryId":"99","ListUrlMain":"https://bsapubs.onlinelibrary.wiley.com/doi/10.1002/aps3.11623","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PLANT SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

自然历史藏品数字化最慢的步骤之一是将与标本相关的标签转换为可用于藏品管理和研究的数字数据记录。在这里，我们将讨论如何通过提取标准化的达尔文核心字段将标本馆标本标签转换为数字数据记录。我们首先展示了基于规则的方法的发展，并将结果与基于大型语言模型的方法（特别是ChatGPT4）进行了比较。接下来，我们对两种方法使用光学字符识别（OCR）转录的一组标签的目标字段的遗漏率和委托错误率进行了量化。例如，我们发现ChatGPT4经常创建不符合Darwin Core的字段名，而基于规则的方法通常有很高的委托错误率。结果这些方法各有优势和局限性。因此，我们开发了一种集成方法，它利用了每个单独方法的优势，并记录了集成大大减少了总体信息提取错误。这项工作表明，集成方法对于创建高质量的数字数据记录具有特殊的价值，即使对于复杂的标签内容也是如此。虽然仍然需要人工验证以确保最佳质量，但自动化方法可以加速植物标本馆标本标签的数字化，并且可能广泛适用于所有自然历史收藏类型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Ensemble automated approaches for producing high-quality herbarium digital records

查看原文本刊更多论文

Ensemble automated approaches for producing high-quality herbarium digital records

Premise

One of the slowest steps in digitizing natural history collections is converting labels associated with specimens into a digital data record usable for collections management and research. Here, we address how herbarium specimen labels can be converted into digital data records via extraction into standardized Darwin Core fields.

Methods

We first showcase the development of a rule-based approach and compare outcomes with a large language model–based approach, in particular ChatGPT4. We next quantified omission and commission error rates across target fields for a set of labels transcribed using optical character recognition (OCR) for both approaches. For example, we find that ChatGPT4 often creates field names that are not Darwin Core compliant while rule-based approaches often have high commission error rates.

Results

Our results suggest that these approaches each have different strengths and limitations. We therefore developed an ensemble approach that leverages the strengths of each individual method and documented that ensembling strongly reduced overall information extraction errors.

Discussion

This work shows that an ensemble approach has particular value for creating high-quality digital data records, even for complicated label content. While human validation is still needed to ensure the best possible quality, automated approaches can speed digitization of herbarium specimen labels and are likely to be broadly usable for all natural history collection types.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applications in Plant Sciences PLANT SCIENCES-

CiteScore

7.30

自引率

0.00%

发文量

审稿时长

12 weeks

期刊介绍： Applications in Plant Sciences (APPS) is a monthly, peer-reviewed, open access journal promoting the rapid dissemination of newly developed, innovative tools and protocols in all areas of the plant sciences, including genetics, structure, function, development, evolution, systematics, and ecology. Given the rapid progress today in technology and its application in the plant sciences, the goal of APPS is to foster communication within the plant science community to advance scientific research. APPS is a publication of the Botanical Society of America, originating in 2009 as the American Journal of Botany''s online-only section, AJB Primer Notes & Protocols in the Plant Sciences. APPS publishes the following types of articles: (1) Protocol Notes describe new methods and technological advancements; (2) Genomic Resources Articles characterize the development and demonstrate the usefulness of newly developed genomic resources, including transcriptomes; (3) Software Notes detail new software applications; (4) Application Articles illustrate the application of a new protocol, method, or software application within the context of a larger study; (5) Review Articles evaluate available techniques, methods, or protocols; (6) Primer Notes report novel genetic markers with evidence of wide applicability.