A Reproducible Data Analysis Workflow

Aaron Peikert, A. Brandmaier
{"title":"A Reproducible Data Analysis Workflow","authors":"Aaron Peikert, A. Brandmaier","doi":"10.5964/QCMB.3763","DOIUrl":null,"url":null,"abstract":"In this tutorial, we describe a workflow to ensure long-term reproducibility of R-based data analyses. The workflow leverages established tools and practices from software engineering. It combines the benefits of various open-source software tools including R Markdown, Git, Make, and Docker, whose interplay ensures seamless integration of version management, dynamic report generation conforming to various journal styles, and full cross-platform and long-term computational reproducibility. The workflow ensures meeting the primary goals that 1) the reporting of statistical results is consistent with the actual statistical results (dynamic report generation), 2) the analysis exactly reproduces at a later point in time even if the computing platform or software is changed (computational reproducibility), and 3) changes at any time (during development and post-publication) are tracked, tagged, and documented while earlier versions of both data and code remain accessible. While the research community increasingly recognizes dynamic document generation and version management as tools to ensure reproducibility, we demonstrate with practical examples that these alone are not sufficient to ensure long-term computational reproducibility. Combining containerization, dependence management, version management, and dynamic document generation, the proposed workflow increases scientific productivity by facilitating later reproducibility and reuse of code and data.","PeriodicalId":314301,"journal":{"name":"Quantitative and Computational Methods in Behavioral Sciences","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Quantitative and Computational Methods in Behavioral Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5964/QCMB.3763","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

In this tutorial, we describe a workflow to ensure long-term reproducibility of R-based data analyses. The workflow leverages established tools and practices from software engineering. It combines the benefits of various open-source software tools including R Markdown, Git, Make, and Docker, whose interplay ensures seamless integration of version management, dynamic report generation conforming to various journal styles, and full cross-platform and long-term computational reproducibility. The workflow ensures meeting the primary goals that 1) the reporting of statistical results is consistent with the actual statistical results (dynamic report generation), 2) the analysis exactly reproduces at a later point in time even if the computing platform or software is changed (computational reproducibility), and 3) changes at any time (during development and post-publication) are tracked, tagged, and documented while earlier versions of both data and code remain accessible. While the research community increasingly recognizes dynamic document generation and version management as tools to ensure reproducibility, we demonstrate with practical examples that these alone are not sufficient to ensure long-term computational reproducibility. Combining containerization, dependence management, version management, and dynamic document generation, the proposed workflow increases scientific productivity by facilitating later reproducibility and reuse of code and data.
一个可重复的数据分析工作流
在本教程中,我们描述了一个工作流,以确保基于r的数据分析的长期可重复性。工作流利用了来自软件工程的已建立的工具和实践。它结合了各种开源软件工具的优点,包括R Markdown、Git、Make和Docker,它们的相互作用确保了版本管理的无缝集成,符合各种期刊风格的动态报表生成,以及完全跨平台和长期的计算再现性。工作流确保满足以下主要目标:1)统计结果的报告与实际统计结果一致(动态报告生成);2)即使计算平台或软件发生了变化(计算再现性),也可以在稍后的时间点精确地再现分析;3)在任何时候(在开发期间和发布后)的变化都被跟踪、标记和记录,同时数据和代码的早期版本仍然可以访问。虽然研究界越来越多地认识到动态文档生成和版本管理是确保再现性的工具,但我们用实际例子证明,仅凭这些并不足以确保长期的计算再现性。结合容器化、依赖性管理、版本管理和动态文档生成,建议的工作流通过促进代码和数据的后期再现性和重用来提高科学生产力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信