Reproducible Scientific Workflows for High Performance and Cloud Computing

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI:10.1109/CCGRID.2019.00028

Felix Bartusch, Maximilian Hanussek, Jens Krüger, O. Kohlbacher

引用次数: 5

Abstract

Many complex data analysis tasks are performed by scientific workflows and pipelines deployed on high performance computing (HPC) or cloud computing resources. The complex software stack required by a workflow and unnoticed dependencies can make the deployment of a pipeline a demanding task. Once deployed, workflows tend to be black boxes, especially for users that did not create the pipeline themselves. At the end of a project a researcher should archive the pipeline in order to ensure reproducibility of published results. This paper illustrates a possible solution for each of the three tasks: reproducible deployment via software containers, automated generation of provenance information to break black boxes, and using the CiTAR service for archiving software containers.

查看原文本刊更多论文

用于高性能和云计算的可重复科学工作流

许多复杂的数据分析任务是通过部署在高性能计算(HPC)或云计算资源上的科学工作流和管道来完成的。工作流所需的复杂软件堆栈和未注意到的依赖关系可能使管道的部署成为一项艰巨的任务。一旦部署，工作流往往是黑盒，特别是对于那些没有自己创建管道的用户。在项目结束时，研究人员应该将管道存档，以确保已发表结果的可重复性。本文为这三个任务中的每一个说明了一个可能的解决方案:通过软件容器进行可重复部署，自动生成来源信息以打破黑盒，以及使用CiTAR服务存档软件容器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

自引率

0.00%

发文量