Precisely and Persistently Identifying and Citing Arbitrary Subsets of Dynamic Data

Issue 3.4, Fall 2021 Pub Date : 2021-10-28 DOI:10.1162/99608f92.be565013

A. Rauber, Bernhard Gößwein, C. Zwölf, C. Schubert, Florian Wörister, James Duncan, Katharina Flicker, K. Zettsu, Kristof Meixner, L. McIntosh, R. Jenkyns, Stefan Pröll, Tomasz Miksa, M. Parsons

{"title":"Precisely and Persistently Identifying and Citing Arbitrary Subsets of Dynamic Data","authors":"A. Rauber, Bernhard Gößwein, C. Zwölf, C. Schubert, Florian Wörister, James Duncan, Katharina Flicker, K. Zettsu, Kristof Meixner, L. McIntosh, R. Jenkyns, Stefan Pröll, Tomasz Miksa, M. Parsons","doi":"10.1162/99608f92.be565013","DOIUrl":null,"url":null,"abstract":"Precisely identifying arbitrary subsets of data so that these can be re-produced is a daunting challenge in data-driven science, the more so if the underlying data source is dynamically evolving. Yet, most settings exhibit exactly those characteristics: increasingly larger amounts of data being continuously ingested from a range of sources, with error correction and quality improvement processes adding to the dynamics. Yet, for studies to be reproducible, for decision-making to be transparent, and for meta studies to be performed conveniently, having a precise identiﬁcation mechanism to reference, retrieve and work with such data is essential. The RDA Working Group on Dynamic Data Citation has published 14 recommendations that are centered around timestamping and versioning evolving data sources and identifying subsets dynamically via persistent identiﬁers that are assigned to the queries selecting the respective subsets. These principles are generic and work for virtually any kind of data. In the past few years numerous repositories around the globe have implemented these recommendations and deployed solution. This paper provides an overview of the recommendations, reference implementations and pilot systems deployed and analyses key lessons learned from these. This provides a solid","PeriodicalId":250931,"journal":{"name":"Issue 3.4, Fall 2021","volume":"49 5","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Issue 3.4, Fall 2021","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1162/99608f92.be565013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Precisely identifying arbitrary subsets of data so that these can be re-produced is a daunting challenge in data-driven science, the more so if the underlying data source is dynamically evolving. Yet, most settings exhibit exactly those characteristics: increasingly larger amounts of data being continuously ingested from a range of sources, with error correction and quality improvement processes adding to the dynamics. Yet, for studies to be reproducible, for decision-making to be transparent, and for meta studies to be performed conveniently, having a precise identiﬁcation mechanism to reference, retrieve and work with such data is essential. The RDA Working Group on Dynamic Data Citation has published 14 recommendations that are centered around timestamping and versioning evolving data sources and identifying subsets dynamically via persistent identiﬁers that are assigned to the queries selecting the respective subsets. These principles are generic and work for virtually any kind of data. In the past few years numerous repositories around the globe have implemented these recommendations and deployed solution. This paper provides an overview of the recommendations, reference implementations and pilot systems deployed and analyses key lessons learned from these. This provides a solid

查看原文本刊更多论文

精确持久地识别和引用动态数据的任意子集

在数据驱动的科学中，精确地识别任意数据子集以便重新生成这些子集是一项艰巨的挑战，如果底层数据源是动态发展的，则更是如此。然而，大多数设置都表现出这些特征:不断从各种来源摄取越来越多的数据，并伴随着错误纠正和质量改进过程增加了动态。然而，为了研究的可重复性，为了决策的透明度，为了meta研究的方便进行，拥有一个精确的识别机制来参考、检索和处理这些数据是必不可少的。RDA动态数据引用工作组发布了14项建议，这些建议围绕时间戳和版本控制不断发展的数据源，以及通过分配给选择各自子集的查询的持久标识符来动态标识子集。这些原则是通用的，几乎适用于任何类型的数据。在过去的几年中，全球各地的许多存储库已经实现了这些建议并部署了解决方案。本文概述了建议、参考实施和部署的试点系统，并分析了从中吸取的主要经验教训。这提供了一个坚实的

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Issue 3.4, Fall 2021

自引率

0.00%

发文量