Improving Shuffle I/O performance for big data processing using hybrid storage

X. Ruan, Haiquan Chen
{"title":"Improving Shuffle I/O performance for big data processing using hybrid storage","authors":"X. Ruan, Haiquan Chen","doi":"10.1109/ICCNC.2017.7876175","DOIUrl":null,"url":null,"abstract":"Nowadays big data analytics have been widely used in many domains, e.g., weather forecast, social network analysis, scientific computing, and bioinformatics. As indispensable part of big data analytics, MapReduce has become the de facto standard model of the distributed computing framework. With the growing complexity of software and hardware components, big data analytics systems face the challenge of performance bottleneck when handling the increasing size of computing workloads. In our study, we reveal that the existing Shuffle mechanism in the current Spark implementation is still the performance bottleneck due to the Shuffle I/O latency. We demonstrate that the Shuffle stage causes performance degradation among MapReduce jobs. By observing that the high-end Solid State Disks (SSDs) are capable of handling random writes well due to efficient flash translation layer algorithms and larger on-board I/O cache, we present a hybrid storage system-based solution that uses hard drive disks (HDDs) for large datasets storage and SSDs for improving Shuffle I/O performance to mitigate this performance degradation issue. Our extensive experiments using both real-world and synthetic workloads show that our hybrid storage system-based approach achieves performance improvement in the Shuffle stage compared with the original HDD-based Spark implementation.","PeriodicalId":135028,"journal":{"name":"2017 International Conference on Computing, Networking and Communications (ICNC)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Computing, Networking and Communications (ICNC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCNC.2017.7876175","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Nowadays big data analytics have been widely used in many domains, e.g., weather forecast, social network analysis, scientific computing, and bioinformatics. As indispensable part of big data analytics, MapReduce has become the de facto standard model of the distributed computing framework. With the growing complexity of software and hardware components, big data analytics systems face the challenge of performance bottleneck when handling the increasing size of computing workloads. In our study, we reveal that the existing Shuffle mechanism in the current Spark implementation is still the performance bottleneck due to the Shuffle I/O latency. We demonstrate that the Shuffle stage causes performance degradation among MapReduce jobs. By observing that the high-end Solid State Disks (SSDs) are capable of handling random writes well due to efficient flash translation layer algorithms and larger on-board I/O cache, we present a hybrid storage system-based solution that uses hard drive disks (HDDs) for large datasets storage and SSDs for improving Shuffle I/O performance to mitigate this performance degradation issue. Our extensive experiments using both real-world and synthetic workloads show that our hybrid storage system-based approach achieves performance improvement in the Shuffle stage compared with the original HDD-based Spark implementation.
提高使用混合存储处理大数据的Shuffle I/O性能
如今,大数据分析已广泛应用于天气预报、社会网络分析、科学计算、生物信息学等领域。作为大数据分析不可或缺的一部分,MapReduce已经成为分布式计算框架事实上的标准模型。随着软件和硬件组件的日益复杂,大数据分析系统在处理日益增长的计算工作量时面临性能瓶颈的挑战。在我们的研究中,我们发现由于Shuffle I/O延迟,当前Spark实现中现有的Shuffle机制仍然是性能瓶颈。我们证明Shuffle阶段会导致MapReduce作业之间的性能下降。通过观察高端固态硬盘(ssd)由于高效的闪存转换层算法和更大的板载I/O缓存而能够很好地处理随机写入,我们提出了一种基于混合存储系统的解决方案,该解决方案使用硬盘驱动器(hdd)存储大型数据集,使用ssd提高Shuffle I/O性能,以缓解这种性能下降问题。我们使用实际工作负载和合成工作负载进行的大量实验表明,与原始的基于hdd的Spark实现相比,我们基于混合存储系统的方法在Shuffle阶段实现了性能改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信