{"title":"Improving Shuffle I/O performance for big data processing using hybrid storage","authors":"X. Ruan, Haiquan Chen","doi":"10.1109/ICCNC.2017.7876175","DOIUrl":null,"url":null,"abstract":"Nowadays big data analytics have been widely used in many domains, e.g., weather forecast, social network analysis, scientific computing, and bioinformatics. As indispensable part of big data analytics, MapReduce has become the de facto standard model of the distributed computing framework. With the growing complexity of software and hardware components, big data analytics systems face the challenge of performance bottleneck when handling the increasing size of computing workloads. In our study, we reveal that the existing Shuffle mechanism in the current Spark implementation is still the performance bottleneck due to the Shuffle I/O latency. We demonstrate that the Shuffle stage causes performance degradation among MapReduce jobs. By observing that the high-end Solid State Disks (SSDs) are capable of handling random writes well due to efficient flash translation layer algorithms and larger on-board I/O cache, we present a hybrid storage system-based solution that uses hard drive disks (HDDs) for large datasets storage and SSDs for improving Shuffle I/O performance to mitigate this performance degradation issue. Our extensive experiments using both real-world and synthetic workloads show that our hybrid storage system-based approach achieves performance improvement in the Shuffle stage compared with the original HDD-based Spark implementation.","PeriodicalId":135028,"journal":{"name":"2017 International Conference on Computing, Networking and Communications (ICNC)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Computing, Networking and Communications (ICNC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCNC.2017.7876175","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Nowadays big data analytics have been widely used in many domains, e.g., weather forecast, social network analysis, scientific computing, and bioinformatics. As indispensable part of big data analytics, MapReduce has become the de facto standard model of the distributed computing framework. With the growing complexity of software and hardware components, big data analytics systems face the challenge of performance bottleneck when handling the increasing size of computing workloads. In our study, we reveal that the existing Shuffle mechanism in the current Spark implementation is still the performance bottleneck due to the Shuffle I/O latency. We demonstrate that the Shuffle stage causes performance degradation among MapReduce jobs. By observing that the high-end Solid State Disks (SSDs) are capable of handling random writes well due to efficient flash translation layer algorithms and larger on-board I/O cache, we present a hybrid storage system-based solution that uses hard drive disks (HDDs) for large datasets storage and SSDs for improving Shuffle I/O performance to mitigate this performance degradation issue. Our extensive experiments using both real-world and synthetic workloads show that our hybrid storage system-based approach achieves performance improvement in the Shuffle stage compared with the original HDD-based Spark implementation.