"Publish First": A Rapid, GPT-4 Based Digitisation System for Small Institutes with Minimal Resources

Biodiversity Information Science and Standards Pub Date : 2023-09-11 DOI:10.3897/biss.7.112428

Rukaya Johaadien, Michal Torma

{"title":"\"Publish First\": A Rapid, GPT-4 Based Digitisation System for Small Institutes with Minimal Resources","authors":"Rukaya Johaadien, Michal Torma","doi":"10.3897/biss.7.112428","DOIUrl":null,"url":null,"abstract":"We present a streamlined technical solution (\"Publish First\") designed to assist smaller, resource-constrained herbaria in rapidly publishing their specimens to the Global Biodiversity Information Facility (GBIF). Specimen data from smaller herbaria, particularly those in biodiversity-rich regions of the world, provide a valuable and often unique contribution to the global pool of biodiversity knowledge (Marsico et al. 2020). However, these institutions often face challenges not applicable to larger herbaria, including a lack of staff with technical skills, limited staff hours for digitization work, inadequate financial resources for specialized scanning equipment, cameras, lights, and imaging stands, limited (or no) access to computers and collection management software, and unreliable internet connections. Data-scarce and biodiversity rich countries are also often linguistically diverse (Gorenflo et al. 2012), and staff may not have English skills, which means pre-existing online data publication resources and guides are of limited use. The \"Publish First\" method we are trialing, addresses several of these issues: it drastically simplifies the publication process so technical skills are not necessary; it minimizes administrative tasks saving time; it uses simple, cheap and easily available hardware; it does not require any specialized software; and the process is so simple that there is little to no need for any written instructions. \"Publish first\" requires staff to attach QR code labels containing identifiers to herbarium specimen sheets, scan these sheets using a document scanner costing around €300, then drag and drop these files to an S3 bucket (a cloud container that specialises in storing files). Subsequently, these images are automatically processed through an Optical Character Recognition (OCR) service to extract text, which is then passed on to OpenAI's Generative Pre-Transformer 4 (GPT-4) Application Programming Interface (API), for standardization. The standardized data is integrated into a Darwin Core Archive file that is automatically published through GBIF's Integrated Publishing Toolkit (IPT) (GBIF 2021). The most technically challenging aspect of this project has been the standardization of OCR data to Darwin Core using the GPT-4 API, particularly in crafting precise prompts to address the inherent inconsistency and lack of reliability in these Large Language Models (LLMs). Despite this, GPT-4 outperformed our manual scraping efforts. Our choice of GPT-4 as a model was a naive one: we implemented the workflow on some pre-digitized specimens from previously published Norwegian collections, compared the published data on GBIF with GPT-4's Darwin Core standardized output, and found the results satisfactory. Moving forward, we plan to undertake more rigorous additional research to compare the effectiveness and cost-efficiency of different LLMs as Darwin Core standardization engines. We are also particularly interested in exploring the new \"function calling\" feature added to the GPT-4 API, as it promises to allow us to retrieve standardized data in a more consistent and structured format. This workflow is currently under trial in Tajikistan, and may possibly be used in Uzbekistan, Armenia and Italy in the near future.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodiversity Information Science and Standards","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3897/biss.7.112428","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We present a streamlined technical solution ("Publish First") designed to assist smaller, resource-constrained herbaria in rapidly publishing their specimens to the Global Biodiversity Information Facility (GBIF). Specimen data from smaller herbaria, particularly those in biodiversity-rich regions of the world, provide a valuable and often unique contribution to the global pool of biodiversity knowledge (Marsico et al. 2020). However, these institutions often face challenges not applicable to larger herbaria, including a lack of staff with technical skills, limited staff hours for digitization work, inadequate financial resources for specialized scanning equipment, cameras, lights, and imaging stands, limited (or no) access to computers and collection management software, and unreliable internet connections. Data-scarce and biodiversity rich countries are also often linguistically diverse (Gorenflo et al. 2012), and staff may not have English skills, which means pre-existing online data publication resources and guides are of limited use. The "Publish First" method we are trialing, addresses several of these issues: it drastically simplifies the publication process so technical skills are not necessary; it minimizes administrative tasks saving time; it uses simple, cheap and easily available hardware; it does not require any specialized software; and the process is so simple that there is little to no need for any written instructions. "Publish first" requires staff to attach QR code labels containing identifiers to herbarium specimen sheets, scan these sheets using a document scanner costing around €300, then drag and drop these files to an S3 bucket (a cloud container that specialises in storing files). Subsequently, these images are automatically processed through an Optical Character Recognition (OCR) service to extract text, which is then passed on to OpenAI's Generative Pre-Transformer 4 (GPT-4) Application Programming Interface (API), for standardization. The standardized data is integrated into a Darwin Core Archive file that is automatically published through GBIF's Integrated Publishing Toolkit (IPT) (GBIF 2021). The most technically challenging aspect of this project has been the standardization of OCR data to Darwin Core using the GPT-4 API, particularly in crafting precise prompts to address the inherent inconsistency and lack of reliability in these Large Language Models (LLMs). Despite this, GPT-4 outperformed our manual scraping efforts. Our choice of GPT-4 as a model was a naive one: we implemented the workflow on some pre-digitized specimens from previously published Norwegian collections, compared the published data on GBIF with GPT-4's Darwin Core standardized output, and found the results satisfactory. Moving forward, we plan to undertake more rigorous additional research to compare the effectiveness and cost-efficiency of different LLMs as Darwin Core standardization engines. We are also particularly interested in exploring the new "function calling" feature added to the GPT-4 API, as it promises to allow us to retrieve standardized data in a more consistent and structured format. This workflow is currently under trial in Tajikistan, and may possibly be used in Uzbekistan, Armenia and Italy in the near future.

查看原文本刊更多论文

“出版优先”:一个快速的、基于GPT-4的数字化系统，用于资源最少的小型研究所

我们提出了一个简化的技术解决方案(“出版优先”)，旨在帮助较小的、资源有限的植物标本馆快速将其标本发布到全球生物多样性信息设施(GBIF)。来自小型植物标本馆的标本数据，特别是来自世界上生物多样性丰富地区的标本数据，为全球生物多样性知识库提供了宝贵且往往独特的贡献(Marsico et al. 2020)。然而，这些机构经常面临与大型植物标本馆不同的挑战，包括缺乏具有技术技能的工作人员，数字化工作的工作时间有限，专门扫描设备，相机，灯和成像架的财政资源不足，有限(或没有)访问计算机和收集管理软件，以及不可靠的互联网连接。数据稀缺和生物多样性丰富的国家往往语言多样(Gorenflo et al. 2012)，工作人员可能不具备英语技能，这意味着现有的在线数据出版资源和指南的用途有限。我们正在尝试的“发布优先”方法解决了其中几个问题:它极大地简化了发布过程，因此不需要技术技能;它最大限度地减少管理任务，节省时间;它使用简单、廉价和容易获得的硬件;它不需要任何专门的软件;这个过程非常简单，几乎不需要任何书面说明。“先发布”要求工作人员将包含标识符的QR码标签贴在植物标本上，使用成本约为300欧元的文档扫描仪扫描这些样本，然后将这些文件拖放到S3桶(一种专门存储文件的云容器)中。随后，通过光学字符识别(OCR)服务对这些图像进行自动处理以提取文本，然后将其传递给OpenAI的生成预变形4 (GPT-4)应用程序编程接口(API)进行标准化。标准化数据被集成到达尔文核心档案文件中，该文件通过GBIF的集成出版工具包(IPT) (GBIF 2021)自动发布。该项目最具技术挑战的方面是使用GPT-4 API将OCR数据标准化到Darwin Core，特别是在制作精确提示以解决这些大型语言模型(llm)中固有的不一致和缺乏可靠性。尽管如此，GPT-4还是比我们手工抓取的效果好。我们选择GPT-4作为模型是一个幼稚的选择:我们在之前发表的挪威馆藏的一些预数字化标本上实施了工作流，将GBIF上发表的数据与GPT-4的达尔文核心标准化输出进行了比较，发现结果令人满意。展望未来，我们计划进行更严格的额外研究，以比较不同llm作为达尔文核心标准化引擎的有效性和成本效率。我们还对探索GPT-4 API中添加的新“函数调用”特性特别感兴趣，因为它承诺允许我们以更一致和结构化的格式检索标准化数据。这一工作流程目前正在塔吉克斯坦试用，不久的将来可能会在乌兹别克斯坦、亚美尼亚和意大利使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biodiversity Information Science and Standards

自引率

0.00%

发文量