A Reproducible Data Analysis Workflow

Quantitative and Computational Methods in Behavioral Sciences Pub Date : 2021-05-11 DOI:10.5964/QCMB.3763

Aaron Peikert, A. Brandmaier

{"title":"A Reproducible Data Analysis Workflow","authors":"Aaron Peikert, A. Brandmaier","doi":"10.5964/QCMB.3763","DOIUrl":null,"url":null,"abstract":"In this tutorial, we describe a workflow to ensure long-term reproducibility of R-based data analyses. The workflow leverages established tools and practices from software engineering. It combines the benefits of various open-source software tools including R Markdown, Git, Make, and Docker, whose interplay ensures seamless integration of version management, dynamic report generation conforming to various journal styles, and full cross-platform and long-term computational reproducibility. The workflow ensures meeting the primary goals that 1) the reporting of statistical results is consistent with the actual statistical results (dynamic report generation), 2) the analysis exactly reproduces at a later point in time even if the computing platform or software is changed (computational reproducibility), and 3) changes at any time (during development and post-publication) are tracked, tagged, and documented while earlier versions of both data and code remain accessible. While the research community increasingly recognizes dynamic document generation and version management as tools to ensure reproducibility, we demonstrate with practical examples that these alone are not sufficient to ensure long-term computational reproducibility. Combining containerization, dependence management, version management, and dynamic document generation, the proposed workflow increases scientific productivity by facilitating later reproducibility and reuse of code and data.","PeriodicalId":314301,"journal":{"name":"Quantitative and Computational Methods in Behavioral Sciences","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Quantitative and Computational Methods in Behavioral Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5964/QCMB.3763","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

In this tutorial, we describe a workflow to ensure long-term reproducibility of R-based data analyses. The workflow leverages established tools and practices from software engineering. It combines the benefits of various open-source software tools including R Markdown, Git, Make, and Docker, whose interplay ensures seamless integration of version management, dynamic report generation conforming to various journal styles, and full cross-platform and long-term computational reproducibility. The workflow ensures meeting the primary goals that 1) the reporting of statistical results is consistent with the actual statistical results (dynamic report generation), 2) the analysis exactly reproduces at a later point in time even if the computing platform or software is changed (computational reproducibility), and 3) changes at any time (during development and post-publication) are tracked, tagged, and documented while earlier versions of both data and code remain accessible. While the research community increasingly recognizes dynamic document generation and version management as tools to ensure reproducibility, we demonstrate with practical examples that these alone are not sufficient to ensure long-term computational reproducibility. Combining containerization, dependence management, version management, and dynamic document generation, the proposed workflow increases scientific productivity by facilitating later reproducibility and reuse of code and data.

查看原文本刊更多论文

一个可重复的数据分析工作流

在本教程中，我们描述了一个工作流，以确保基于r的数据分析的长期可重复性。工作流利用了来自软件工程的已建立的工具和实践。它结合了各种开源软件工具的优点，包括R Markdown、Git、Make和Docker，它们的相互作用确保了版本管理的无缝集成，符合各种期刊风格的动态报表生成，以及完全跨平台和长期的计算再现性。工作流确保满足以下主要目标:1)统计结果的报告与实际统计结果一致(动态报告生成);2)即使计算平台或软件发生了变化(计算再现性)，也可以在稍后的时间点精确地再现分析;3)在任何时候(在开发期间和发布后)的变化都被跟踪、标记和记录，同时数据和代码的早期版本仍然可以访问。虽然研究界越来越多地认识到动态文档生成和版本管理是确保再现性的工具，但我们用实际例子证明，仅凭这些并不足以确保长期的计算再现性。结合容器化、依赖性管理、版本管理和动态文档生成，建议的工作流通过促进代码和数据的后期再现性和重用来提高科学生产力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Quantitative and Computational Methods in Behavioral Sciences

自引率

0.00%

发文量