为可持续的未来挖掘过去:为气候行动提取和转换生物多样性遗产图书馆的数据

Biodiversity Information Science and Standards Pub Date : 2023-09-11 DOI:10.3897/biss.7.112436

JJ Dearborn, Mike Lichtenberg, Joel Richard, Joseph deVeer, Michael Trizna, Katie Mika

{"title":"为可持续的未来挖掘过去:为气候行动提取和转换生物多样性遗产图书馆的数据","authors":"JJ Dearborn, Mike Lichtenberg, Joel Richard, Joseph deVeer, Michael Trizna, Katie Mika","doi":"10.3897/biss.7.112436","DOIUrl":null,"url":null,"abstract":"As the urgency to address the climate crisis intensifies, the availability of accurate and comprehensive biodiversity data has become crucial for informing climate change studies, tracking key environmental indicators, and building global biodiversity monitoring platforms. The Biodiversity Heritage Library (BHL) plays a vital role in the core biodiversity infrastructure, housing over 60 million pages of digitized literature about life on Earth. Recognizing the value of over 500 years of data in BHL, a global network of BHL staff is working to establish a scalable data pipeline to provide actionable occurrence data from BHL’s vast and diverse collections. However, transforming textual content into FAIR (findable, accessible, interoperable, reusable) data poses challenges due to missing descriptive metadata and error-ridden unstructured outputs from commercial text engines. (Fig. 1) Despite the wealth of knowledge in BHL now available to global audiences, the underutilization of biodiversity and climate data contained in BHL's textual corpus hinders scientific research, hampers informed decision-making for conservation efforts, and limits our understanding of biodiversity patterns crucial for addressing the climate crisis. By leveraging recent advancements in text recognition engines, along with cutting-edge AI (Artificial Intelligence) models like OpenAI’s CLIP (Contrastive Language-Image Pre-Training) and nascent features in transcription platforms, BHL staff are beginning to process vast amounts of textual and image data and transform centuries worth of data from BHL collections into computationally usable formats. Recent technological breakthroughs now offer a transformative opportunity to empower the global biodiversity community with prescient insights from our shared past and facilitate the integration of historical knowledge into climate action initiatives. To bridge gaps in the historical record and unlock the potential of the Biodiversity Heritage Library (BHL), a multi-pronged effort utilizing innovative cross-disciplinary approaches is being piloted. These technical approaches were selected for their efficiency and ability to generate rapid results that could be applied across the diverse range of materials in BHL. (Fig. 2) Piloting a data pipeline that is scalable to 60 million pages requires considerable investigation, experimentation, and resources but will have an appreciable impact on global conservation efforts by informing and establishing historic baselines deeper into time. This presentation will focus on the identification, extraction, and transformation of OCR into structured data outputs in BHL. Approaches include: Upgrading legacy OCR text using Tesseract OCR engine to improve data quality by 20% and openly publish 40 GBs of textual data as FAIR data; Evaluating handwritten text recognition (HTR) engines (Microsoft Azure Computer Vision, Google Cloud Vision API (Application Programming Interface), and Amazon Textract) to improve scientific name-finding in BHL’s handwritten archival materials using algorithms developed by Global Names Architecture; Extracting data from collecting events using HTR coordinate outputs with Python library Pandas DataFrame to create structured data; Classifying BHL page-level images with OpenAI's CLIP, a neural network model to accurately identify the handwritten sub-corpus of primary source materials in BHL; Running an A/B test to evaluate the efficiency and accuracy of human-keyed transcription data extraction to provide high-quality, human-vetted datasets that can be deposited with data aggregators. Upgrading legacy OCR text using Tesseract OCR engine to improve data quality by 20% and openly publish 40 GBs of textual data as FAIR data; Evaluating handwritten text recognition (HTR) engines (Microsoft Azure Computer Vision, Google Cloud Vision API (Application Programming Interface), and Amazon Textract) to improve scientific name-finding in BHL’s handwritten archival materials using algorithms developed by Global Names Architecture; Extracting data from collecting events using HTR coordinate outputs with Python library Pandas DataFrame to create structured data; Classifying BHL page-level images with OpenAI's CLIP, a neural network model to accurately identify the handwritten sub-corpus of primary source materials in BHL; Running an A/B test to evaluate the efficiency and accuracy of human-keyed transcription data extraction to provide high-quality, human-vetted datasets that can be deposited with data aggregators. The ongoing development of a scalable data pipeline of BHL’s relevant biodiversity and climate-related datasets requires sustained support and partnership with the biodiversity community. Initial results demonstrate that liberating data from archival and handwritten field notes is arduous but feasible. Extending these methodologies to the broader scientific literature presents new research opportunities. Extracting and normalizing data from unstructured textual sources can significantly advance biodiversity research and inform environmental policy. The Biodiversity Heritage Library staff are committed to building multiple scalable data pipelines with the ultimate goal of erecting a global biodiversity knowledge graph, rich in interconnected data and semantic meaning, enabling informed decisions for the preservation and sustainable management of Earth's biodiversity.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Unearthing the Past for a Sustainable Future: Extracting and transforming data in the Biodiversity Heritage Library for climate action\",\"authors\":\"JJ Dearborn, Mike Lichtenberg, Joel Richard, Joseph deVeer, Michael Trizna, Katie Mika\",\"doi\":\"10.3897/biss.7.112436\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the urgency to address the climate crisis intensifies, the availability of accurate and comprehensive biodiversity data has become crucial for informing climate change studies, tracking key environmental indicators, and building global biodiversity monitoring platforms. The Biodiversity Heritage Library (BHL) plays a vital role in the core biodiversity infrastructure, housing over 60 million pages of digitized literature about life on Earth. Recognizing the value of over 500 years of data in BHL, a global network of BHL staff is working to establish a scalable data pipeline to provide actionable occurrence data from BHL’s vast and diverse collections. However, transforming textual content into FAIR (findable, accessible, interoperable, reusable) data poses challenges due to missing descriptive metadata and error-ridden unstructured outputs from commercial text engines. (Fig. 1) Despite the wealth of knowledge in BHL now available to global audiences, the underutilization of biodiversity and climate data contained in BHL's textual corpus hinders scientific research, hampers informed decision-making for conservation efforts, and limits our understanding of biodiversity patterns crucial for addressing the climate crisis. By leveraging recent advancements in text recognition engines, along with cutting-edge AI (Artificial Intelligence) models like OpenAI’s CLIP (Contrastive Language-Image Pre-Training) and nascent features in transcription platforms, BHL staff are beginning to process vast amounts of textual and image data and transform centuries worth of data from BHL collections into computationally usable formats. Recent technological breakthroughs now offer a transformative opportunity to empower the global biodiversity community with prescient insights from our shared past and facilitate the integration of historical knowledge into climate action initiatives. To bridge gaps in the historical record and unlock the potential of the Biodiversity Heritage Library (BHL), a multi-pronged effort utilizing innovative cross-disciplinary approaches is being piloted. These technical approaches were selected for their efficiency and ability to generate rapid results that could be applied across the diverse range of materials in BHL. (Fig. 2) Piloting a data pipeline that is scalable to 60 million pages requires considerable investigation, experimentation, and resources but will have an appreciable impact on global conservation efforts by informing and establishing historic baselines deeper into time. This presentation will focus on the identification, extraction, and transformation of OCR into structured data outputs in BHL. Approaches include: Upgrading legacy OCR text using Tesseract OCR engine to improve data quality by 20% and openly publish 40 GBs of textual data as FAIR data; Evaluating handwritten text recognition (HTR) engines (Microsoft Azure Computer Vision, Google Cloud Vision API (Application Programming Interface), and Amazon Textract) to improve scientific name-finding in BHL’s handwritten archival materials using algorithms developed by Global Names Architecture; Extracting data from collecting events using HTR coordinate outputs with Python library Pandas DataFrame to create structured data; Classifying BHL page-level images with OpenAI's CLIP, a neural network model to accurately identify the handwritten sub-corpus of primary source materials in BHL; Running an A/B test to evaluate the efficiency and accuracy of human-keyed transcription data extraction to provide high-quality, human-vetted datasets that can be deposited with data aggregators. Upgrading legacy OCR text using Tesseract OCR engine to improve data quality by 20% and openly publish 40 GBs of textual data as FAIR data; Evaluating handwritten text recognition (HTR) engines (Microsoft Azure Computer Vision, Google Cloud Vision API (Application Programming Interface), and Amazon Textract) to improve scientific name-finding in BHL’s handwritten archival materials using algorithms developed by Global Names Architecture; Extracting data from collecting events using HTR coordinate outputs with Python library Pandas DataFrame to create structured data; Classifying BHL page-level images with OpenAI's CLIP, a neural network model to accurately identify the handwritten sub-corpus of primary source materials in BHL; Running an A/B test to evaluate the efficiency and accuracy of human-keyed transcription data extraction to provide high-quality, human-vetted datasets that can be deposited with data aggregators. The ongoing development of a scalable data pipeline of BHL’s relevant biodiversity and climate-related datasets requires sustained support and partnership with the biodiversity community. Initial results demonstrate that liberating data from archival and handwritten field notes is arduous but feasible. Extending these methodologies to the broader scientific literature presents new research opportunities. Extracting and normalizing data from unstructured textual sources can significantly advance biodiversity research and inform environmental policy. The Biodiversity Heritage Library staff are committed to building multiple scalable data pipelines with the ultimate goal of erecting a global biodiversity knowledge graph, rich in interconnected data and semantic meaning, enabling informed decisions for the preservation and sustainable management of Earth's biodiversity.\",\"PeriodicalId\":9011,\"journal\":{\"name\":\"Biodiversity Information Science and Standards\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biodiversity Information Science and Standards\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3897/biss.7.112436\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodiversity Information Science and Standards","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3897/biss.7.112436","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着应对气候危机的迫切性日益加剧，获得准确、全面的生物多样性数据对于气候变化研究、跟踪关键环境指标和建立全球生物多样性监测平台至关重要。生物多样性遗产图书馆(BHL)在核心生物多样性基础设施中发挥着至关重要的作用，拥有超过6000万页关于地球上生命的数字化文献。认识到BHL超过500年数据的价值，BHL员工的全球网络正在努力建立一个可扩展的数据管道，以提供BHL庞大而多样的收集的可操作的发生数据。然而，将文本内容转换为FAIR(可查找、可访问、可互操作、可重用)数据带来了挑战，因为缺少描述性元数据和来自商业文本引擎的错误频出的非结构化输出。(图1)尽管全球读者现在可以获得BHL的丰富知识，但BHL文本语料库中包含的生物多样性和气候数据的利用不足阻碍了科学研究，阻碍了保护工作的知情决策，并限制了我们对应对气候危机至关重要的生物多样性模式的理解。通过利用文本识别引擎的最新进展，以及像OpenAI的CLIP(对比语言图像预训练)这样的尖端AI(人工智能)模型和转录平台的新功能，BHL的工作人员开始处理大量的文本和图像数据，并将BHL收集的几个世纪的数据转换为计算可用的格式。最近的技术突破现在提供了一个变革性的机会，使全球生物多样性界能够从我们共同的过去中获得先见之明，并促进将历史知识纳入气候行动倡议。为了弥合历史记录上的差距，释放生物多样性遗产图书馆(BHL)的潜力，目前正在试点一项多管齐下的努力，利用创新的跨学科方法。选择这些技术方法是因为它们的效率和产生快速结果的能力，可以应用于BHL的各种材料。(图2)试点一个可扩展到6000万页的数据管道需要大量的调查、实验和资源，但通过深入了解和建立历史基线，将对全球保护工作产生可观的影响。本演讲将重点介绍BHL中OCR的识别、提取和转换为结构化数据输出。方法包括:使用Tesseract OCR引擎升级传统OCR文本，将数据质量提高20%，并公开发布40gb文本数据作为FAIR数据;评估手写文本识别(HTR)引擎(Microsoft Azure Computer Vision，谷歌Cloud Vision API(应用程序编程接口)和Amazon text)，以使用Global Names Architecture开发的算法改善BHL手写档案材料中的科学名称查找;使用HTR坐标输出从收集事件中提取数据，并使用Python库Pandas DataFrame创建结构化数据;使用OpenAI的CLIP神经网络模型对BHL页面级图像进行分类，以准确识别BHL主要源材料的手写子语料库;运行A/B测试来评估人工关键转录数据提取的效率和准确性，以提供高质量的、人工审查的数据集，这些数据集可以存储在数据聚合器中。使用Tesseract OCR引擎升级传统OCR文本，将数据质量提高20%，并公开发布40gb文本数据作为FAIR数据;评估手写文本识别(HTR)引擎(Microsoft Azure Computer Vision，谷歌Cloud Vision API(应用程序编程接口)和Amazon text)，以使用Global Names Architecture开发的算法改善BHL手写档案材料中的科学名称查找;使用HTR坐标输出从收集事件中提取数据，并使用Python库Pandas DataFrame创建结构化数据;使用OpenAI的CLIP神经网络模型对BHL页面级图像进行分类，以准确识别BHL主要源材料的手写子语料库;运行A/B测试来评估人工关键转录数据提取的效率和准确性，以提供高质量的、人工审查的数据集，这些数据集可以存储在数据聚合器中。BHL相关生物多样性和气候相关数据集的可扩展数据管道的持续发展需要生物多样性社区的持续支持和伙伴关系。初步结果表明，从档案和手写的实地记录中解放数据是艰巨的，但却是可行的。将这些方法扩展到更广泛的科学文献中提供了新的研究机会。从非结构化文本来源中提取和规范化数据可以显著推进生物多样性研究并为环境政策提供信息。生物多样性遗产图书馆的工作人员致力于建立多个可扩展的数据管道，最终目标是建立一个全球生物多样性知识图谱，丰富的相互关联的数据和语义，为保护和可持续管理地球生物多样性提供明智的决策。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Unearthing the Past for a Sustainable Future: Extracting and transforming data in the Biodiversity Heritage Library for climate action

As the urgency to address the climate crisis intensifies, the availability of accurate and comprehensive biodiversity data has become crucial for informing climate change studies, tracking key environmental indicators, and building global biodiversity monitoring platforms. The Biodiversity Heritage Library (BHL) plays a vital role in the core biodiversity infrastructure, housing over 60 million pages of digitized literature about life on Earth. Recognizing the value of over 500 years of data in BHL, a global network of BHL staff is working to establish a scalable data pipeline to provide actionable occurrence data from BHL’s vast and diverse collections. However, transforming textual content into FAIR (findable, accessible, interoperable, reusable) data poses challenges due to missing descriptive metadata and error-ridden unstructured outputs from commercial text engines. (Fig. 1) Despite the wealth of knowledge in BHL now available to global audiences, the underutilization of biodiversity and climate data contained in BHL's textual corpus hinders scientific research, hampers informed decision-making for conservation efforts, and limits our understanding of biodiversity patterns crucial for addressing the climate crisis. By leveraging recent advancements in text recognition engines, along with cutting-edge AI (Artificial Intelligence) models like OpenAI’s CLIP (Contrastive Language-Image Pre-Training) and nascent features in transcription platforms, BHL staff are beginning to process vast amounts of textual and image data and transform centuries worth of data from BHL collections into computationally usable formats. Recent technological breakthroughs now offer a transformative opportunity to empower the global biodiversity community with prescient insights from our shared past and facilitate the integration of historical knowledge into climate action initiatives. To bridge gaps in the historical record and unlock the potential of the Biodiversity Heritage Library (BHL), a multi-pronged effort utilizing innovative cross-disciplinary approaches is being piloted. These technical approaches were selected for their efficiency and ability to generate rapid results that could be applied across the diverse range of materials in BHL. (Fig. 2) Piloting a data pipeline that is scalable to 60 million pages requires considerable investigation, experimentation, and resources but will have an appreciable impact on global conservation efforts by informing and establishing historic baselines deeper into time. This presentation will focus on the identification, extraction, and transformation of OCR into structured data outputs in BHL. Approaches include: Upgrading legacy OCR text using Tesseract OCR engine to improve data quality by 20% and openly publish 40 GBs of textual data as FAIR data; Evaluating handwritten text recognition (HTR) engines (Microsoft Azure Computer Vision, Google Cloud Vision API (Application Programming Interface), and Amazon Textract) to improve scientific name-finding in BHL’s handwritten archival materials using algorithms developed by Global Names Architecture; Extracting data from collecting events using HTR coordinate outputs with Python library Pandas DataFrame to create structured data; Classifying BHL page-level images with OpenAI's CLIP, a neural network model to accurately identify the handwritten sub-corpus of primary source materials in BHL; Running an A/B test to evaluate the efficiency and accuracy of human-keyed transcription data extraction to provide high-quality, human-vetted datasets that can be deposited with data aggregators. Upgrading legacy OCR text using Tesseract OCR engine to improve data quality by 20% and openly publish 40 GBs of textual data as FAIR data; Evaluating handwritten text recognition (HTR) engines (Microsoft Azure Computer Vision, Google Cloud Vision API (Application Programming Interface), and Amazon Textract) to improve scientific name-finding in BHL’s handwritten archival materials using algorithms developed by Global Names Architecture; Extracting data from collecting events using HTR coordinate outputs with Python library Pandas DataFrame to create structured data; Classifying BHL page-level images with OpenAI's CLIP, a neural network model to accurately identify the handwritten sub-corpus of primary source materials in BHL; Running an A/B test to evaluate the efficiency and accuracy of human-keyed transcription data extraction to provide high-quality, human-vetted datasets that can be deposited with data aggregators. The ongoing development of a scalable data pipeline of BHL’s relevant biodiversity and climate-related datasets requires sustained support and partnership with the biodiversity community. Initial results demonstrate that liberating data from archival and handwritten field notes is arduous but feasible. Extending these methodologies to the broader scientific literature presents new research opportunities. Extracting and normalizing data from unstructured textual sources can significantly advance biodiversity research and inform environmental policy. The Biodiversity Heritage Library staff are committed to building multiple scalable data pipelines with the ultimate goal of erecting a global biodiversity knowledge graph, rich in interconnected data and semantic meaning, enabling informed decisions for the preservation and sustainable management of Earth's biodiversity.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biodiversity Information Science and Standards

自引率

0.00%

发文量