The Specimen Data Refinery: A Canonical Workflow Framework and FAIR Digital Object Approach to Speeding up Digital Mobilisation of Natural History Collections

IF 1.3 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Intelligence Pub Date : 2022-03-07 DOI:10.1162/dint_a_00134

A. Hardisty, P. Brack, C. Goble, Laurence Livermore, Ben Scott, Q. Groom, S. Owen, S. Soiland-Reyes

{"title":"The Specimen Data Refinery: A Canonical Workflow Framework and FAIR Digital Object Approach to Speeding up Digital Mobilisation of Natural History Collections","authors":"A. Hardisty, P. Brack, C. Goble, Laurence Livermore, Ben Scott, Q. Groom, S. Owen, S. Soiland-Reyes","doi":"10.1162/dint_a_00134","DOIUrl":null,"url":null,"abstract":"Abstract A key limiting factor in organising and using information from physical specimens curated in natural science collections is making that information computable, with institutional digitization tending to focus more on imaging the specimens themselves than on efficiently capturing computable data about them. Label data are traditionally manually transcribed today with high cost and low throughput, rendering such a task constrained for many collection-holding institutions at current funding levels. We show how computer vision, optical character recognition, handwriting recognition, named entity recognition and language translation technologies can be implemented into canonical workflow component libraries with findable, accessible, interoperable, and reusable (FAIR) characteristics. These libraries are being developed in a cloud-based workflow platform—the ‘Specimen Data Refinery’ (SDR)—founded on Galaxy workflow engine, Common Workflow Language, Research Object Crates (RO-Crate) and WorkflowHub technologies. The SDR can be applied to specimens’ labels and other artefacts, offering the prospect of greatly accelerated and more accurate data capture in computable form. Two kinds of FAIR Digital Objects (FDO) are created by packaging outputs of SDR workflows and workflow components as digital objects with metadata, a persistent identifier, and a specific type definition. The first kind of FDO are computable Digital Specimen (DS) objects that can be consumed/produced by workflows, and other applications. A single DS is the input data structure submitted to a workflow that is modified by each workflow component in turn to produce a refined DS at the end. The Specimen Data Refinery provides a library of such components that can be used individually, or in series. To cofunction, each library component describes the fields it requires from the DS and the fields it will in turn populate or enrich. The second kind of FDO, RO-Crates gather and archive the diverse set of digital and real-world resources, configurations, and actions (the provenance) contributing to a unit of research work, allowing that work to be faithfully recorded and reproduced. Here we describe the Specimen Data Refinery with its motivating requirements, focusing on what is essential in the creation of canonical workflow component libraries and its conformance with the requirements of an emerging FDO Core Specification being developed by the FDO Forum.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"320-341"},"PeriodicalIF":1.3000,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/dint_a_00134","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 7

Abstract

Abstract A key limiting factor in organising and using information from physical specimens curated in natural science collections is making that information computable, with institutional digitization tending to focus more on imaging the specimens themselves than on efficiently capturing computable data about them. Label data are traditionally manually transcribed today with high cost and low throughput, rendering such a task constrained for many collection-holding institutions at current funding levels. We show how computer vision, optical character recognition, handwriting recognition, named entity recognition and language translation technologies can be implemented into canonical workflow component libraries with findable, accessible, interoperable, and reusable (FAIR) characteristics. These libraries are being developed in a cloud-based workflow platform—the ‘Specimen Data Refinery’ (SDR)—founded on Galaxy workflow engine, Common Workflow Language, Research Object Crates (RO-Crate) and WorkflowHub technologies. The SDR can be applied to specimens’ labels and other artefacts, offering the prospect of greatly accelerated and more accurate data capture in computable form. Two kinds of FAIR Digital Objects (FDO) are created by packaging outputs of SDR workflows and workflow components as digital objects with metadata, a persistent identifier, and a specific type definition. The first kind of FDO are computable Digital Specimen (DS) objects that can be consumed/produced by workflows, and other applications. A single DS is the input data structure submitted to a workflow that is modified by each workflow component in turn to produce a refined DS at the end. The Specimen Data Refinery provides a library of such components that can be used individually, or in series. To cofunction, each library component describes the fields it requires from the DS and the fields it will in turn populate or enrich. The second kind of FDO, RO-Crates gather and archive the diverse set of digital and real-world resources, configurations, and actions (the provenance) contributing to a unit of research work, allowing that work to be faithfully recorded and reproduced. Here we describe the Specimen Data Refinery with its motivating requirements, focusing on what is essential in the creation of canonical workflow component libraries and its conformance with the requirements of an emerging FDO Core Specification being developed by the FDO Forum.

查看原文本刊更多论文

标本数据精炼厂:一个规范的工作流程框架和公平的数字对象方法，以加速自然历史藏品的数字化动员

组织和使用自然科学馆藏中物理标本信息的一个关键限制因素是使这些信息可计算，机构数字化倾向于更多地关注标本本身的成像，而不是有效地捕获有关它们的可计算数据。如今，标签数据传统上是手工转录的，成本高，吞吐量低，这使得许多收藏机构在目前的资金水平下无法完成这样的任务。我们展示了如何将计算机视觉、光学字符识别、手写识别、命名实体识别和语言翻译技术实现到具有可查找、可访问、可互操作和可重用(FAIR)特征的规范化工作流组件库中。这些库是在基于云的工作流平台——“样本数据精化”(SDR)中开发的，该平台基于Galaxy工作流引擎、通用工作流语言、研究对象crate (RO-Crate)和WorkflowHub技术。SDR可以应用于标本的标签和其他人工制品，提供了以可计算形式大大加速和更准确的数据捕获的前景。通过将SDR工作流和工作流组件的输出打包为具有元数据、持久标识符和特定类型定义的数字对象，可以创建两种FAIR数字对象(FDO)。第一种FDO是可计算的数字样本(DS)对象，可以由工作流和其他应用程序消费/产生。单个DS是提交给工作流的输入数据结构，每个工作流组件依次对其进行修改，最终生成精细化的DS。样本数据精炼厂提供了一个这样的组件库，可以单独使用，也可以串联使用。为了协同工作，每个库组件描述了它需要从DS获得的字段，以及它将依次填充或充实的字段。第二种类型的FDO, RO-Crates收集和存档各种数字和现实世界的资源、配置和行为(来源)，为研究工作单位做出贡献，允许该工作被忠实地记录和复制。在这里，我们将描述样本数据精炼厂及其激励需求，重点关注规范化工作流组件库的创建中必不可少的内容，以及它与FDO论坛正在开发的新兴FDO核心规范的需求的一致性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊