Exploiting MapReduce and data compression for data-intensive applications

Guangchen Ruan, Hui Zhang, Beth Plale
{"title":"Exploiting MapReduce and data compression for data-intensive applications","authors":"Guangchen Ruan, Hui Zhang, Beth Plale","doi":"10.1145/2484762.2484785","DOIUrl":null,"url":null,"abstract":"HPC platform shows good success for predominantly compute-intensive jobs, however, data intensive jobs still struggle on HPC platform as large amounts of concurrent data movement from I/O nodes to compute nodes can easily saturate the network links. MapReduce, the \"moving computation to data\" paradigm for many pleasingly parallel applications, assumes that data are resident on local disks and computation is scheduled where the data are located. However, on an HPC machine data must be staged from a broader file system (such as Luster), to HDFS where it can be accessed; this staging can represent a substantial delay in processing. In this paper we look at data compression's effect on reducing bandwidth needs of getting data to the application, as well as its impact on the overall performance of data-intensive applications. Our study examines two types of applications, a 3D-time series caries lesion assessment focusing on large scale medical image dataset, and a HTRC word counting task concerning large scale text analysis running on XSEDE resources. Our extensive experimental results demonstrate significant performance improvement in terms of storage space, data stage-in time, and job execution time.","PeriodicalId":426819,"journal":{"name":"Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484762.2484785","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

HPC platform shows good success for predominantly compute-intensive jobs, however, data intensive jobs still struggle on HPC platform as large amounts of concurrent data movement from I/O nodes to compute nodes can easily saturate the network links. MapReduce, the "moving computation to data" paradigm for many pleasingly parallel applications, assumes that data are resident on local disks and computation is scheduled where the data are located. However, on an HPC machine data must be staged from a broader file system (such as Luster), to HDFS where it can be accessed; this staging can represent a substantial delay in processing. In this paper we look at data compression's effect on reducing bandwidth needs of getting data to the application, as well as its impact on the overall performance of data-intensive applications. Our study examines two types of applications, a 3D-time series caries lesion assessment focusing on large scale medical image dataset, and a HTRC word counting task concerning large scale text analysis running on XSEDE resources. Our extensive experimental results demonstrate significant performance improvement in terms of storage space, data stage-in time, and job execution time.
为数据密集型应用程序开发MapReduce和数据压缩
HPC平台在以计算密集型为主的作业方面取得了良好的成功,但是,数据密集型作业在HPC平台上仍然很挣扎,因为从I/O节点到计算节点的大量并发数据移动很容易使网络链路饱和。MapReduce是许多令人愉快的并行应用程序的“将计算移动到数据”范例,它假设数据驻留在本地磁盘上,计算在数据所在的位置进行调度。然而,在HPC机器上,数据必须从一个更广泛的文件系统(如Luster)暂放到HDFS,在那里它可以被访问;这种分段可以表示处理过程中的大量延迟。在本文中,我们将研究数据压缩对减少向应用程序获取数据的带宽需求的影响,以及它对数据密集型应用程序的总体性能的影响。我们的研究考察了两种类型的应用,一种是专注于大规模医学图像数据集的3d时间序列龋齿损伤评估,另一种是在XSEDE资源上运行的涉及大规模文本分析的HTRC单词计数任务。我们广泛的实验结果表明,在存储空间、数据阶段导入时间和作业执行时间方面,性能有了显著提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信