神经影像学大数据处理策略的性能评价

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2018-12-16 DOI:10.1109/CCGRID.2019.00059

Valérie Hayot-Sasson, Shawn T. Brown, T. Glatard

{"title":"神经影像学大数据处理策略的性能评价","authors":"Valérie Hayot-Sasson, Shawn T. Brown, T. Glatard","doi":"10.1109/CCGRID.2019.00059","DOIUrl":null,"url":null,"abstract":"Neuroimaging datasets are rapidly growing in size as a result of advancements in image acquisition methods, open-science and data sharing. However, the adoption of Big Data processing strategies by neuroimaging processing engines remains limited. Here, we evaluate three Big Data processing strategies (in-memory computing, data locality and lazy evaluation) on typical neuroimaging use cases, represented by the BigBrain dataset. We contrast these various strategies using Apache Spark and Nipype as our representative Big Data and neuroimaging processing engines, on Dell EMC's Top-500 cluster. Big Data thresholds were modeled by comparing the data-write rate of the application to the filesystem bandwidth and number of concurrent processes. This model acknowledges the fact that page caching provided by the Linux kernel is critical to the performance of Big Data applications. Results show that in-memory computing alone speeds-up executions by a factor of up to 1.6, whereas when combined with data locality, this factor reaches 5.3. Lazy evaluation strategies were found to increase the likelihood of cache hits, further improving processing time. Such important speed-up values are likely to be observed on typical image processing operations performed on images of size larger than 75GB. A ballpark speculation from our model showed that in-memory computing alone will not speed-up current functional MRI analyses unless coupled with data locality and processing around 280 subjects concurrently. Furthermore, we observe that emulating in-memory computing using in-memory file systems (tmpfs) does not reach the performance of an in-memory engine, presumably due to swapping to disk and the lack of data cleanup. We conclude that Big Data processing strategies are worth developing for neuroimaging applications.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Performance Evaluation of Big Data Processing Strategies for Neuroimaging\",\"authors\":\"Valérie Hayot-Sasson, Shawn T. Brown, T. Glatard\",\"doi\":\"10.1109/CCGRID.2019.00059\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Neuroimaging datasets are rapidly growing in size as a result of advancements in image acquisition methods, open-science and data sharing. However, the adoption of Big Data processing strategies by neuroimaging processing engines remains limited. Here, we evaluate three Big Data processing strategies (in-memory computing, data locality and lazy evaluation) on typical neuroimaging use cases, represented by the BigBrain dataset. We contrast these various strategies using Apache Spark and Nipype as our representative Big Data and neuroimaging processing engines, on Dell EMC's Top-500 cluster. Big Data thresholds were modeled by comparing the data-write rate of the application to the filesystem bandwidth and number of concurrent processes. This model acknowledges the fact that page caching provided by the Linux kernel is critical to the performance of Big Data applications. Results show that in-memory computing alone speeds-up executions by a factor of up to 1.6, whereas when combined with data locality, this factor reaches 5.3. Lazy evaluation strategies were found to increase the likelihood of cache hits, further improving processing time. Such important speed-up values are likely to be observed on typical image processing operations performed on images of size larger than 75GB. A ballpark speculation from our model showed that in-memory computing alone will not speed-up current functional MRI analyses unless coupled with data locality and processing around 280 subjects concurrently. Furthermore, we observe that emulating in-memory computing using in-memory file systems (tmpfs) does not reach the performance of an in-memory engine, presumably due to swapping to disk and the lack of data cleanup. We conclude that Big Data processing strategies are worth developing for neuroimaging applications.\",\"PeriodicalId\":234571,\"journal\":{\"name\":\"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)\",\"volume\":\"75 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGRID.2019.00059\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00059","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

由于图像采集方法、开放科学和数据共享的进步，神经成像数据集的规模正在迅速增长。然而，神经成像处理引擎对大数据处理策略的采用仍然有限。在这里，我们在典型的神经成像用例上评估了三种大数据处理策略(内存计算、数据局部化和延迟评估)，这些用例以BigBrain数据集为代表。我们在Dell EMC的Top-500集群上使用Apache Spark和Nipype作为我们的大数据和神经成像处理引擎的代表来对比这些不同的策略。大数据阈值通过将应用程序的数据写入速率与文件系统带宽和并发进程数进行比较来建模。该模型承认Linux内核提供的页面缓存对大数据应用程序的性能至关重要。结果表明，仅内存计算就可以将执行速度提高1.6倍，而当与数据位置相结合时，这个系数达到5.3倍。发现延迟评估策略增加了缓存命中的可能性，进一步改善了处理时间。在对大于75GB的图像执行典型的图像处理操作时，可能会观察到如此重要的加速值。从我们的模型中得出的一个大致推测表明，内存计算本身不会加速当前的功能性MRI分析，除非与数据局域性和同时处理大约280个受试者相结合。此外，我们观察到，使用内存文件系统(tmpfs)模拟内存计算无法达到内存引擎的性能，可能是由于交换磁盘和缺乏数据清理。我们得出结论，大数据处理策略值得开发用于神经成像应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance Evaluation of Big Data Processing Strategies for Neuroimaging

Neuroimaging datasets are rapidly growing in size as a result of advancements in image acquisition methods, open-science and data sharing. However, the adoption of Big Data processing strategies by neuroimaging processing engines remains limited. Here, we evaluate three Big Data processing strategies (in-memory computing, data locality and lazy evaluation) on typical neuroimaging use cases, represented by the BigBrain dataset. We contrast these various strategies using Apache Spark and Nipype as our representative Big Data and neuroimaging processing engines, on Dell EMC's Top-500 cluster. Big Data thresholds were modeled by comparing the data-write rate of the application to the filesystem bandwidth and number of concurrent processes. This model acknowledges the fact that page caching provided by the Linux kernel is critical to the performance of Big Data applications. Results show that in-memory computing alone speeds-up executions by a factor of up to 1.6, whereas when combined with data locality, this factor reaches 5.3. Lazy evaluation strategies were found to increase the likelihood of cache hits, further improving processing time. Such important speed-up values are likely to be observed on typical image processing operations performed on images of size larger than 75GB. A ballpark speculation from our model showed that in-memory computing alone will not speed-up current functional MRI analyses unless coupled with data locality and processing around 280 subjects concurrently. Furthermore, we observe that emulating in-memory computing using in-memory file systems (tmpfs) does not reach the performance of an in-memory engine, presumably due to swapping to disk and the lack of data cleanup. We conclude that Big Data processing strategies are worth developing for neuroimaging applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

自引率

0.00%

发文量