面向固态硬盘的Hadoop框架优化

2016 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2016-06-01 DOI:10.1109/BigDataCongress.2016.11

Jae-Ki Hong, Liang Li, Chihye Han, Bingxu Jin, Qichao Yang, Zilong Yang

{"title":"面向固态硬盘的Hadoop框架优化","authors":"Jae-Ki Hong, Liang Li, Chihye Han, Bingxu Jin, Qichao Yang, Zilong Yang","doi":"10.1109/BigDataCongress.2016.11","DOIUrl":null,"url":null,"abstract":"Solid state drives (SSDs) have been widely used in Hadoop clusters ever since their introduction to the big data industry. However, the current Hadoop framework is not optimized to take full advantage of SSDs. In this paper, we introduce architectural improvements in the core Hadoop components to fully exploit the performance benefits of SSDs for data-and compute-intensive workloads. The improved architecture features: a simplified data handling algorithm that utilizes SSD's high random IOPS to store and shuffle the map output data, an accurate pre-read model for HDFS based on libaio to reduce read latency and improve request parallelism, a record size based reduce scheduler to overcome the data skew problem in the reduce phase, and a new block placement policy of HDFS based on the disk wear information to manage SSDs' lifetime. The simplified map output collector and the pre-read model of HDFS show 30% and 18% of performance improvement with Terasort and DFSIO benchmarks, respectively. The modified reduce scheduler shows 12% faster execution time with a real MapReduce application. To extend these results, we affirm that the modified structure also achieves 21% performance improvement on Samsung's MicroBrick-based hyperscale system.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Optimizing Hadoop Framework for Solid State Drives\",\"authors\":\"Jae-Ki Hong, Liang Li, Chihye Han, Bingxu Jin, Qichao Yang, Zilong Yang\",\"doi\":\"10.1109/BigDataCongress.2016.11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Solid state drives (SSDs) have been widely used in Hadoop clusters ever since their introduction to the big data industry. However, the current Hadoop framework is not optimized to take full advantage of SSDs. In this paper, we introduce architectural improvements in the core Hadoop components to fully exploit the performance benefits of SSDs for data-and compute-intensive workloads. The improved architecture features: a simplified data handling algorithm that utilizes SSD's high random IOPS to store and shuffle the map output data, an accurate pre-read model for HDFS based on libaio to reduce read latency and improve request parallelism, a record size based reduce scheduler to overcome the data skew problem in the reduce phase, and a new block placement policy of HDFS based on the disk wear information to manage SSDs' lifetime. The simplified map output collector and the pre-read model of HDFS show 30% and 18% of performance improvement with Terasort and DFSIO benchmarks, respectively. The modified reduce scheduler shows 12% faster execution time with a real MapReduce application. To extend these results, we affirm that the modified structure also achieves 21% performance improvement on Samsung's MicroBrick-based hyperscale system.\",\"PeriodicalId\":407471,\"journal\":{\"name\":\"2016 IEEE International Congress on Big Data (BigData Congress)\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE International Congress on Big Data (BigData Congress)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BigDataCongress.2016.11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BigDataCongress.2016.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

自从固态硬盘(ssd)被引入大数据行业以来，它就被广泛应用于Hadoop集群中。然而，目前的Hadoop框架并没有优化到充分利用ssd。在本文中，我们介绍了核心Hadoop组件的架构改进，以充分利用ssd在数据和计算密集型工作负载方面的性能优势。改进后的架构特点:简化的数据处理算法，利用SSD的高随机IOPS存储和shuffle map输出数据;基于libaio的HDFS精确预读模型，减少读取延迟，提高请求并行性;基于记录大小的reduce调度器，克服reduce阶段的数据倾斜问题;基于磁盘磨损信息的HDFS新的块放置策略，管理SSD的生命周期。简化的映射输出收集器和HDFS的预读模型在Terasort和DFSIO基准测试中分别显示了30%和18%的性能提升。修改后的reduce调度器显示，在真正的MapReduce应用程序中，执行时间快了12%。为了扩展这些结果，我们确认修改后的结构在三星基于microbrick的超大规模系统上也实现了21%的性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimizing Hadoop Framework for Solid State Drives

Solid state drives (SSDs) have been widely used in Hadoop clusters ever since their introduction to the big data industry. However, the current Hadoop framework is not optimized to take full advantage of SSDs. In this paper, we introduce architectural improvements in the core Hadoop components to fully exploit the performance benefits of SSDs for data-and compute-intensive workloads. The improved architecture features: a simplified data handling algorithm that utilizes SSD's high random IOPS to store and shuffle the map output data, an accurate pre-read model for HDFS based on libaio to reduce read latency and improve request parallelism, a record size based reduce scheduler to overcome the data skew problem in the reduce phase, and a new block placement policy of HDFS based on the disk wear information to manage SSDs' lifetime. The simplified map output collector and the pre-read model of HDFS show 30% and 18% of performance improvement with Terasort and DFSIO benchmarks, respectively. The modified reduce scheduler shows 12% faster execution time with a real MapReduce application. To extend these results, we affirm that the modified structure also achieves 21% performance improvement on Samsung's MicroBrick-based hyperscale system.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 IEEE International Congress on Big Data (BigData Congress)

自引率

0.00%

发文量