数据密集型神经成像管道中Dask和Apache Spark的性能比较

2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS) Pub Date : 2019-07-30 DOI:10.1109/WORKS49585.2019.00010

Mathieu Dugré, Valérie Hayot-Sasson, T. Glatard

{"title":"数据密集型神经成像管道中Dask和Apache Spark的性能比较","authors":"Mathieu Dugré, Valérie Hayot-Sasson, T. Glatard","doi":"10.1109/WORKS49585.2019.00010","DOIUrl":null,"url":null,"abstract":"In the past few years, neuroimaging has entered the Big Data era due to the joint increase in image resolution, data sharing, and study sizes. However, no particular Big Data engines have emerged in this field, and several alternatives remain available. We compare two popular Big Data engines with Python APIs, Apache Spark and Dask, for their runtime performance in processing neuroimaging pipelines. Our evaluation uses two synthetic pipelines processing the 81GB BigBrain image, and a real pipeline processing anatomical data from more than 1,000 subjects. We benchmark these pipelines using various combinations of task durations, data sizes, and numbers of workers, deployed on an 8-node (8 cores ea.) compute cluster in Compute Canada's Arbutus cloud. We evaluate PySpark's RDD API against Dask's Bag, Delayed and Futures. Results show that despite slight differences between Spark and Dask, both engines perform comparably. However, Dask pipelines risk being limited by Python's GIL depending on task type and cluster configuration. In all cases, the major limiting factor was data transfer. While either engine is suitable for neuroimaging pipelines, more effort needs to be placed in reducing data transfer time.","PeriodicalId":436926,"journal":{"name":"2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"A Performance Comparison of Dask and Apache Spark for Data-Intensive Neuroimaging Pipelines\",\"authors\":\"Mathieu Dugré, Valérie Hayot-Sasson, T. Glatard\",\"doi\":\"10.1109/WORKS49585.2019.00010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the past few years, neuroimaging has entered the Big Data era due to the joint increase in image resolution, data sharing, and study sizes. However, no particular Big Data engines have emerged in this field, and several alternatives remain available. We compare two popular Big Data engines with Python APIs, Apache Spark and Dask, for their runtime performance in processing neuroimaging pipelines. Our evaluation uses two synthetic pipelines processing the 81GB BigBrain image, and a real pipeline processing anatomical data from more than 1,000 subjects. We benchmark these pipelines using various combinations of task durations, data sizes, and numbers of workers, deployed on an 8-node (8 cores ea.) compute cluster in Compute Canada's Arbutus cloud. We evaluate PySpark's RDD API against Dask's Bag, Delayed and Futures. Results show that despite slight differences between Spark and Dask, both engines perform comparably. However, Dask pipelines risk being limited by Python's GIL depending on task type and cluster configuration. In all cases, the major limiting factor was data transfer. While either engine is suitable for neuroimaging pipelines, more effort needs to be placed in reducing data transfer time.\",\"PeriodicalId\":436926,\"journal\":{\"name\":\"2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WORKS49585.2019.00010\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WORKS49585.2019.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

在过去的几年里，由于图像分辨率、数据共享和研究规模的共同提高，神经影像学进入了大数据时代。然而，在这个领域还没有特别的大数据引擎出现，还有一些替代方案可供选择。我们比较了两种流行的使用Python api的大数据引擎，Apache Spark和Dask，它们在处理神经成像管道方面的运行时性能。我们的评估使用两个合成管道处理81GB的BigBrain图像，一个真实管道处理来自1000多名受试者的解剖数据。我们使用任务持续时间、数据大小和工作人员数量的各种组合对这些管道进行基准测试，这些管道部署在compute Canada的Arbutus云中的8节点(8核)计算集群上。我们将PySpark的RDD API与Dask的Bag、Delayed和Futures进行比较。结果表明，尽管Spark和Dask之间存在细微差异，但这两个引擎的性能相当。然而，根据任务类型和集群配置，任务管道可能会受到Python GIL的限制。在所有情况下，主要的限制因素是数据传输。虽然这两种引擎都适用于神经成像管道，但在减少数据传输时间方面需要付出更多的努力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Performance Comparison of Dask and Apache Spark for Data-Intensive Neuroimaging Pipelines

In the past few years, neuroimaging has entered the Big Data era due to the joint increase in image resolution, data sharing, and study sizes. However, no particular Big Data engines have emerged in this field, and several alternatives remain available. We compare two popular Big Data engines with Python APIs, Apache Spark and Dask, for their runtime performance in processing neuroimaging pipelines. Our evaluation uses two synthetic pipelines processing the 81GB BigBrain image, and a real pipeline processing anatomical data from more than 1,000 subjects. We benchmark these pipelines using various combinations of task durations, data sizes, and numbers of workers, deployed on an 8-node (8 cores ea.) compute cluster in Compute Canada's Arbutus cloud. We evaluate PySpark's RDD API against Dask's Bag, Delayed and Futures. Results show that despite slight differences between Spark and Dask, both engines perform comparably. However, Dask pipelines risk being limited by Python's GIL depending on task type and cluster configuration. In all cases, the major limiting factor was data transfer. While either engine is suitable for neuroimaging pipelines, more effort needs to be placed in reducing data transfer time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)

自引率

0.00%

发文量