William Weaver, Kyle Lough, Stephen Smith, Brad Ruhfel
{"title":"The Future of Natural History Transcription: Navigating AI advancements with VoucherVision and the Specimen Label Transcription Project (SLTP)","authors":"William Weaver, Kyle Lough, Stephen Smith, Brad Ruhfel","doi":"10.3897/biss.7.113067","DOIUrl":null,"url":null,"abstract":"Natural history collections are critical reservoirs of biodiversity information but collections staff are constantly grappling with substantial backlogs and limited resources. The task of transcribing specimen label text into searchable databases requires a significant amount of time, manual labor, and funding. To address this challenge, we introduce VoucherVision, a tool harnessing the capabilities of several Large Language Models (LLMs; Naveed et al. 2023) to augment specimen label transcription. The VoucherVision tool automates laborious components of the transcription process, leveraging an Optical Character Recognition (OCR) system and LLMs to convert unstructured label text into appropriate data formats compatible with database ingestion. VoucherVision uses a combination of structured output parsers and recursive re-prompting strategies to ensure consistency and quality of the LLM-formatted text, significantly reducing errors.\n \n Integration of VoucherVision with the University of Michigan Herbarium’s transcription workflow resulted in a significant reduction in per-image transcription time, suggesting significant potential advantages for collections workflows. VoucherVision offers promising strides towards efficient digitization, with curatorial staff playing critical roles in data quality assurance and process oversight. Emphasizing the importance of knowledge sharing, the University of Michigan Herbarium is backing the Specimen Label Transcription Project (SLTP), which will provide open access to benchmarking datasets, fine-tuned models, and validation tools to rank the performance of different methodologies, LLMs, and prompting strategies. In the rapidly evolving landscape of Artificial Intelligence (AI) development, we recognize the profound potential of diverse contributions and innovative methodologies to redefine and advance the transformation of curatorial practices, catalyzing an era of accelerated digitization in natural history collections.\n An early, public version of VoucherVision is available to try here: https://vouchervision.azurewebsites.net/","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodiversity Information Science and Standards","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3897/biss.7.113067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Natural history collections are critical reservoirs of biodiversity information but collections staff are constantly grappling with substantial backlogs and limited resources. The task of transcribing specimen label text into searchable databases requires a significant amount of time, manual labor, and funding. To address this challenge, we introduce VoucherVision, a tool harnessing the capabilities of several Large Language Models (LLMs; Naveed et al. 2023) to augment specimen label transcription. The VoucherVision tool automates laborious components of the transcription process, leveraging an Optical Character Recognition (OCR) system and LLMs to convert unstructured label text into appropriate data formats compatible with database ingestion. VoucherVision uses a combination of structured output parsers and recursive re-prompting strategies to ensure consistency and quality of the LLM-formatted text, significantly reducing errors.
Integration of VoucherVision with the University of Michigan Herbarium’s transcription workflow resulted in a significant reduction in per-image transcription time, suggesting significant potential advantages for collections workflows. VoucherVision offers promising strides towards efficient digitization, with curatorial staff playing critical roles in data quality assurance and process oversight. Emphasizing the importance of knowledge sharing, the University of Michigan Herbarium is backing the Specimen Label Transcription Project (SLTP), which will provide open access to benchmarking datasets, fine-tuned models, and validation tools to rank the performance of different methodologies, LLMs, and prompting strategies. In the rapidly evolving landscape of Artificial Intelligence (AI) development, we recognize the profound potential of diverse contributions and innovative methodologies to redefine and advance the transformation of curatorial practices, catalyzing an era of accelerated digitization in natural history collections.
An early, public version of VoucherVision is available to try here: https://vouchervision.azurewebsites.net/
自然历史馆藏是生物多样性信息的重要储存库,但馆藏工作人员一直在努力解决大量积压和资源有限的问题。将标本标签文本转录到可搜索数据库的任务需要大量的时间、体力劳动和资金。为了应对这一挑战,我们引入了VoucherVision,这是一种利用几个大型语言模型(llm)功能的工具;Naveed et al. 2023)增加标本标签转录。VoucherVision工具自动化了转录过程中费力的组件,利用光学字符识别(OCR)系统和llm将非结构化标签文本转换为与数据库摄取兼容的适当数据格式。VoucherVision使用结构化输出解析器和递归重新提示策略的组合,以确保llm格式文本的一致性和质量,显著减少错误。将VoucherVision与密歇根大学植物标本馆的转录工作流程集成,可以显著减少每张图像的转录时间,这表明馆藏工作流程具有显著的潜在优势。VoucherVision向高效数字化迈进了一大步,管理人员在数据质量保证和流程监督方面发挥了关键作用。为了强调知识共享的重要性,密歇根大学植物标本馆正在支持标本标签转录项目(SLTP),该项目将提供对基准数据集、微调模型和验证工具的开放访问,以对不同方法、法学硕士和提示策略的性能进行排名。在人工智能(AI)快速发展的背景下,我们认识到各种贡献和创新方法的巨大潜力,可以重新定义和推进策展实践的转型,催化自然历史藏品加速数字化的时代。VoucherVision的早期公开版本可以在这里试用:https://vouchervision.azurewebsites.net/