人类在循环中:社区科学与机器学习协同克服标本馆数字化瓶颈

IF 2.7 3区 生物学 Q2 PLANT SCIENCES
Robert Guralnick, Raphael LaFrance, Michael Denslow, Samantha Blickhan, Mark Bouslog, Sean Miller, Jenn Yost, Jason Best, Deborah L. Paul, Elizabeth Ellwood, Edward Gilbert, Julie Allen
{"title":"人类在循环中:社区科学与机器学习协同克服标本馆数字化瓶颈","authors":"Robert Guralnick,&nbsp;Raphael LaFrance,&nbsp;Michael Denslow,&nbsp;Samantha Blickhan,&nbsp;Mark Bouslog,&nbsp;Sean Miller,&nbsp;Jenn Yost,&nbsp;Jason Best,&nbsp;Deborah L. Paul,&nbsp;Elizabeth Ellwood,&nbsp;Edward Gilbert,&nbsp;Julie Allen","doi":"10.1002/aps3.11560","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Premise</h3>\n \n <p>Among the slowest steps in the digitization of natural history collections is converting imaged labels into digital text. We present here a working solution to overcome this long-recognized efficiency bottleneck that leverages synergies between community science efforts and machine learning approaches.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We present two new semi-automated services. The first detects and classifies typewritten, handwritten, or mixed labels from herbarium sheets. The second uses a workflow tuned for specimen labels to label text using optical character recognition (OCR). The label finder and classifier was built via humans-in-the-loop processes that utilize the community science Notes from Nature platform to develop training and validation data sets to feed into a machine learning pipeline.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Our results showcase a &gt;93% success rate for finding and classifying main labels. The OCR pipeline optimizes pre-processing, multiple OCR engines, and post-processing steps, including an alignment approach borrowed from molecular systematics. This pipeline yields &gt;4-fold reductions in errors compared to off-the-shelf open-source solutions. The OCR workflow also allows human validation using a custom Notes from Nature tool.</p>\n </section>\n \n <section>\n \n <h3> Discussion</h3>\n \n <p>Our work showcases a usable set of tools for herbarium digitization including a custom-built web application that is freely accessible. Further work to better integrate these services into existing toolkits can support broad community use.</p>\n </section>\n </div>","PeriodicalId":8022,"journal":{"name":"Applications in Plant Sciences","volume":"12 1","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/aps3.11560","citationCount":"0","resultStr":"{\"title\":\"Humans in the loop: Community science and machine learning synergies for overcoming herbarium digitization bottlenecks\",\"authors\":\"Robert Guralnick,&nbsp;Raphael LaFrance,&nbsp;Michael Denslow,&nbsp;Samantha Blickhan,&nbsp;Mark Bouslog,&nbsp;Sean Miller,&nbsp;Jenn Yost,&nbsp;Jason Best,&nbsp;Deborah L. Paul,&nbsp;Elizabeth Ellwood,&nbsp;Edward Gilbert,&nbsp;Julie Allen\",\"doi\":\"10.1002/aps3.11560\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n \\n <section>\\n \\n <h3> Premise</h3>\\n \\n <p>Among the slowest steps in the digitization of natural history collections is converting imaged labels into digital text. We present here a working solution to overcome this long-recognized efficiency bottleneck that leverages synergies between community science efforts and machine learning approaches.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Methods</h3>\\n \\n <p>We present two new semi-automated services. The first detects and classifies typewritten, handwritten, or mixed labels from herbarium sheets. The second uses a workflow tuned for specimen labels to label text using optical character recognition (OCR). The label finder and classifier was built via humans-in-the-loop processes that utilize the community science Notes from Nature platform to develop training and validation data sets to feed into a machine learning pipeline.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Results</h3>\\n \\n <p>Our results showcase a &gt;93% success rate for finding and classifying main labels. The OCR pipeline optimizes pre-processing, multiple OCR engines, and post-processing steps, including an alignment approach borrowed from molecular systematics. This pipeline yields &gt;4-fold reductions in errors compared to off-the-shelf open-source solutions. The OCR workflow also allows human validation using a custom Notes from Nature tool.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Discussion</h3>\\n \\n <p>Our work showcases a usable set of tools for herbarium digitization including a custom-built web application that is freely accessible. Further work to better integrate these services into existing toolkits can support broad community use.</p>\\n </section>\\n </div>\",\"PeriodicalId\":8022,\"journal\":{\"name\":\"Applications in Plant Sciences\",\"volume\":\"12 1\",\"pages\":\"\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2024-01-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/aps3.11560\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applications in Plant Sciences\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/aps3.11560\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PLANT SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applications in Plant Sciences","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/aps3.11560","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PLANT SCIENCES","Score":null,"Total":0}
引用次数: 0

摘要

在自然历史藏品数字化过程中,最慢的步骤之一就是将成像标签转换为数字文本。我们在此介绍一种有效的解决方案,利用社区科学工作和机器学习方法之间的协同作用,克服这一长期公认的效率瓶颈。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Humans in the loop: Community science and machine learning synergies for overcoming herbarium digitization bottlenecks

Humans in the loop: Community science and machine learning synergies for overcoming herbarium digitization bottlenecks

Premise

Among the slowest steps in the digitization of natural history collections is converting imaged labels into digital text. We present here a working solution to overcome this long-recognized efficiency bottleneck that leverages synergies between community science efforts and machine learning approaches.

Methods

We present two new semi-automated services. The first detects and classifies typewritten, handwritten, or mixed labels from herbarium sheets. The second uses a workflow tuned for specimen labels to label text using optical character recognition (OCR). The label finder and classifier was built via humans-in-the-loop processes that utilize the community science Notes from Nature platform to develop training and validation data sets to feed into a machine learning pipeline.

Results

Our results showcase a >93% success rate for finding and classifying main labels. The OCR pipeline optimizes pre-processing, multiple OCR engines, and post-processing steps, including an alignment approach borrowed from molecular systematics. This pipeline yields >4-fold reductions in errors compared to off-the-shelf open-source solutions. The OCR workflow also allows human validation using a custom Notes from Nature tool.

Discussion

Our work showcases a usable set of tools for herbarium digitization including a custom-built web application that is freely accessible. Further work to better integrate these services into existing toolkits can support broad community use.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
7.30
自引率
0.00%
发文量
50
审稿时长
12 weeks
期刊介绍: Applications in Plant Sciences (APPS) is a monthly, peer-reviewed, open access journal promoting the rapid dissemination of newly developed, innovative tools and protocols in all areas of the plant sciences, including genetics, structure, function, development, evolution, systematics, and ecology. Given the rapid progress today in technology and its application in the plant sciences, the goal of APPS is to foster communication within the plant science community to advance scientific research. APPS is a publication of the Botanical Society of America, originating in 2009 as the American Journal of Botany''s online-only section, AJB Primer Notes & Protocols in the Plant Sciences. APPS publishes the following types of articles: (1) Protocol Notes describe new methods and technological advancements; (2) Genomic Resources Articles characterize the development and demonstrate the usefulness of newly developed genomic resources, including transcriptomes; (3) Software Notes detail new software applications; (4) Application Articles illustrate the application of a new protocol, method, or software application within the context of a larger study; (5) Review Articles evaluate available techniques, methods, or protocols; (6) Primer Notes report novel genetic markers with evidence of wide applicability.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信