{"title":"Born-fair Data Projects Using Cookiecutter Templates","authors":"Felix Henninger","doi":"10.52825/cordi.v1i.331","DOIUrl":null,"url":null,"abstract":"Implementing research data management best practices and FAIR principles (Wilkinson et al., 2016) is vital for transparent, reproducible research, as well as efficient, sustainable science that avoids duplication of effort. Scientists can also benefit directly from incorporating data management into their data collection and analysis workflows. However, there is an initial cost to adoption that poses a burden and substantial barrier to entry even to well-intentioned researchers. In our experience in statistical and RDM-focused consulting, this cost increases as a project progresses, with a late-stage conversion being the most costly in terms of resources and energy because a working analysis needs to be adapted in one step. Therefore, we believe that it is useful to adopt best practices for data stewardship early-on in a project, and ideally from the get-go. \nIn this contribution, we present a tool for creating and instantiating project templates that conform to good practices with regard to the data management and analysis projects more generally. In the same vein as \"born-open data\" (Rouder, 2016) where data is published immediately upon collection, our goal is to establish born-FAIR datasets that implement proven methods for data stewardship as early on in the research data lifecycle as is feasible. Our aim is to encourage researchers and analysts to incorporate best practices into their workflows from the onset by providing data and analysis templates that implement desirable properties. By adopting these templates, researchers immediately gain access to a number of tools that simplify their work and make it more efficient, while also providing a foundation for increased reproducibility, data documentation through codebooks, as well as metadata for long-term archival.Because a broad-strokes approach may not work in practice due to idiosyncrasies of individual research projects, one size may not fit all. For this reason, the templates contain customisation options that researchers can use to tailor the templates to their requirements. \nThe templates build on the well-established cookiecutter library for the Python programming language (Greenfeld et al., 2022), which we additionally extend to R, a programming language somewhat more common among statisticians and social scientists, thereby creating a cross-platform infrastructure. Both libraries create a project skeleton with a pre-specified directory structure, and include configuration for commonly used tools. Upon template creation, a wizard guides users through a customisation step, allowing them to adapt templates to their needs and catering to the demands of a project at hand. \nOwing to the open-source nature of the project and the firmly established and well-documented standard, researchers can easily adapt templates and create their own, to accommodate their specific needs and domain requirements. We hope to foster a community of researchers who share and improve their workflows, and anticipate further uses of the templates for teaching and other purposes. \nAt CoRDI, we hope to introduce our project to the wider NFDI community and propose it as a lightweight, interoperable and interdisciplinary standard, benefiting all researchers across domains. By streamlining advanced users' workflows, and making reproducible practices more accessible, we aim to enable and facilitate the uptake of RDM across the communities represented there, and build integrations to interoperate with the multitude of services currently under development and in use. \nTo summarise, we introduce templates for data analysis and archival that researchers can apply themselves, to render possible and encourage better practices during analysis, and prepare data for long-term storage and later re-use. Our hope is to encourage more researchers to adopt RDM best practices more frequently and earlier in projects, demonstrating the value of a more structured workflow and facilitating a shift to FAIR principles more generally.","PeriodicalId":359879,"journal":{"name":"Proceedings of the Conference on Research Data Infrastructure","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on Research Data Infrastructure","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.52825/cordi.v1i.331","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Implementing research data management best practices and FAIR principles (Wilkinson et al., 2016) is vital for transparent, reproducible research, as well as efficient, sustainable science that avoids duplication of effort. Scientists can also benefit directly from incorporating data management into their data collection and analysis workflows. However, there is an initial cost to adoption that poses a burden and substantial barrier to entry even to well-intentioned researchers. In our experience in statistical and RDM-focused consulting, this cost increases as a project progresses, with a late-stage conversion being the most costly in terms of resources and energy because a working analysis needs to be adapted in one step. Therefore, we believe that it is useful to adopt best practices for data stewardship early-on in a project, and ideally from the get-go.
In this contribution, we present a tool for creating and instantiating project templates that conform to good practices with regard to the data management and analysis projects more generally. In the same vein as "born-open data" (Rouder, 2016) where data is published immediately upon collection, our goal is to establish born-FAIR datasets that implement proven methods for data stewardship as early on in the research data lifecycle as is feasible. Our aim is to encourage researchers and analysts to incorporate best practices into their workflows from the onset by providing data and analysis templates that implement desirable properties. By adopting these templates, researchers immediately gain access to a number of tools that simplify their work and make it more efficient, while also providing a foundation for increased reproducibility, data documentation through codebooks, as well as metadata for long-term archival.Because a broad-strokes approach may not work in practice due to idiosyncrasies of individual research projects, one size may not fit all. For this reason, the templates contain customisation options that researchers can use to tailor the templates to their requirements.
The templates build on the well-established cookiecutter library for the Python programming language (Greenfeld et al., 2022), which we additionally extend to R, a programming language somewhat more common among statisticians and social scientists, thereby creating a cross-platform infrastructure. Both libraries create a project skeleton with a pre-specified directory structure, and include configuration for commonly used tools. Upon template creation, a wizard guides users through a customisation step, allowing them to adapt templates to their needs and catering to the demands of a project at hand.
Owing to the open-source nature of the project and the firmly established and well-documented standard, researchers can easily adapt templates and create their own, to accommodate their specific needs and domain requirements. We hope to foster a community of researchers who share and improve their workflows, and anticipate further uses of the templates for teaching and other purposes.
At CoRDI, we hope to introduce our project to the wider NFDI community and propose it as a lightweight, interoperable and interdisciplinary standard, benefiting all researchers across domains. By streamlining advanced users' workflows, and making reproducible practices more accessible, we aim to enable and facilitate the uptake of RDM across the communities represented there, and build integrations to interoperate with the multitude of services currently under development and in use.
To summarise, we introduce templates for data analysis and archival that researchers can apply themselves, to render possible and encourage better practices during analysis, and prepare data for long-term storage and later re-use. Our hope is to encourage more researchers to adopt RDM best practices more frequently and earlier in projects, demonstrating the value of a more structured workflow and facilitating a shift to FAIR principles more generally.
实施研究数据管理最佳实践和公平原则(Wilkinson et al., 2016)对于透明、可重复的研究以及有效、可持续的科学研究至关重要,可以避免重复工作。科学家还可以从将数据管理纳入数据收集和分析工作流程中直接受益。然而,采用该技术的初始成本对进入该领域的研究人员构成了负担和巨大障碍,即使是对善意的研究人员也是如此。根据我们在统计和rdm咨询方面的经验,该成本随着项目的进展而增加,在资源和能源方面,后期转换是最昂贵的,因为工作分析需要在一个步骤中进行调整。因此,我们认为在项目早期采用数据管理的最佳实践是有用的,理想情况下从一开始就采用。在本文中,我们提供了一个工具,用于创建和实例化项目模板,这些模板符合与数据管理和分析项目相关的良好实践。与“天生开放数据”(Rouder, 2016)一样,数据在收集后立即发布,我们的目标是建立天生公平的数据集,在研究数据生命周期的早期实施行之有效的数据管理方法。我们的目标是通过提供实现理想属性的数据和分析模板,鼓励研究人员和分析人员从一开始就将最佳实践合并到他们的工作流程中。通过采用这些模板,研究人员可以立即访问许多工具,这些工具可以简化他们的工作并提高效率,同时还为通过代码本增加再现性,数据文档以及长期存档的元数据提供基础。由于个别研究项目的特殊性,宽泛的方法在实践中可能行不通,因此一种方法可能不适用于所有研究项目。由于这个原因,模板包含定制选项,研究人员可以使用这些选项来定制模板以满足他们的需求。模板建立在Python编程语言的完善的cookecutter库(Greenfeld等人,2022)上,我们还将其扩展到R,这是一种在统计学家和社会科学家中更常见的编程语言,从而创建了跨平台的基础设施。这两个库都使用预先指定的目录结构创建项目框架,并包含常用工具的配置。模板创建后,向导将引导用户完成定制步骤,允许他们根据自己的需要调整模板,并满足手头项目的需求。由于项目的开源性质以及牢固建立和良好记录的标准,研究人员可以轻松地调整模板并创建自己的模板,以适应他们的特定需求和领域需求。我们希望建立一个研究人员的社区,他们分享和改进他们的工作流程,并期待模板在教学和其他目的上的进一步使用。在CoRDI,我们希望将我们的项目介绍给更广泛的NFDI社区,并将其作为轻量级、可互操作和跨学科的标准提出,使各个领域的所有研究人员受益。通过简化高级用户的工作流程,并使可重复的实践更易于访问,我们的目标是支持并促进RDM在所代表的社区中的采用,并构建集成以与当前正在开发和使用的大量服务进行互操作。总之,我们介绍了数据分析和存档的模板,研究人员可以自己应用,在分析过程中提供可能和鼓励更好的实践,并为长期存储和以后的重用准备数据。我们的希望是鼓励更多的研究人员在项目中更频繁、更早地采用RDM最佳实践,展示更结构化工作流的价值,并促进更普遍地向FAIR原则的转变。