{"title":"面向固态硬盘的Hadoop框架优化","authors":"Jae-Ki Hong, Liang Li, Chihye Han, Bingxu Jin, Qichao Yang, Zilong Yang","doi":"10.1109/BigDataCongress.2016.11","DOIUrl":null,"url":null,"abstract":"Solid state drives (SSDs) have been widely used in Hadoop clusters ever since their introduction to the big data industry. However, the current Hadoop framework is not optimized to take full advantage of SSDs. In this paper, we introduce architectural improvements in the core Hadoop components to fully exploit the performance benefits of SSDs for data-and compute-intensive workloads. The improved architecture features: a simplified data handling algorithm that utilizes SSD's high random IOPS to store and shuffle the map output data, an accurate pre-read model for HDFS based on libaio to reduce read latency and improve request parallelism, a record size based reduce scheduler to overcome the data skew problem in the reduce phase, and a new block placement policy of HDFS based on the disk wear information to manage SSDs' lifetime. The simplified map output collector and the pre-read model of HDFS show 30% and 18% of performance improvement with Terasort and DFSIO benchmarks, respectively. The modified reduce scheduler shows 12% faster execution time with a real MapReduce application. To extend these results, we affirm that the modified structure also achieves 21% performance improvement on Samsung's MicroBrick-based hyperscale system.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Optimizing Hadoop Framework for Solid State Drives\",\"authors\":\"Jae-Ki Hong, Liang Li, Chihye Han, Bingxu Jin, Qichao Yang, Zilong Yang\",\"doi\":\"10.1109/BigDataCongress.2016.11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Solid state drives (SSDs) have been widely used in Hadoop clusters ever since their introduction to the big data industry. However, the current Hadoop framework is not optimized to take full advantage of SSDs. In this paper, we introduce architectural improvements in the core Hadoop components to fully exploit the performance benefits of SSDs for data-and compute-intensive workloads. The improved architecture features: a simplified data handling algorithm that utilizes SSD's high random IOPS to store and shuffle the map output data, an accurate pre-read model for HDFS based on libaio to reduce read latency and improve request parallelism, a record size based reduce scheduler to overcome the data skew problem in the reduce phase, and a new block placement policy of HDFS based on the disk wear information to manage SSDs' lifetime. The simplified map output collector and the pre-read model of HDFS show 30% and 18% of performance improvement with Terasort and DFSIO benchmarks, respectively. The modified reduce scheduler shows 12% faster execution time with a real MapReduce application. To extend these results, we affirm that the modified structure also achieves 21% performance improvement on Samsung's MicroBrick-based hyperscale system.\",\"PeriodicalId\":407471,\"journal\":{\"name\":\"2016 IEEE International Congress on Big Data (BigData Congress)\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE International Congress on Big Data (BigData Congress)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BigDataCongress.2016.11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BigDataCongress.2016.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Optimizing Hadoop Framework for Solid State Drives
Solid state drives (SSDs) have been widely used in Hadoop clusters ever since their introduction to the big data industry. However, the current Hadoop framework is not optimized to take full advantage of SSDs. In this paper, we introduce architectural improvements in the core Hadoop components to fully exploit the performance benefits of SSDs for data-and compute-intensive workloads. The improved architecture features: a simplified data handling algorithm that utilizes SSD's high random IOPS to store and shuffle the map output data, an accurate pre-read model for HDFS based on libaio to reduce read latency and improve request parallelism, a record size based reduce scheduler to overcome the data skew problem in the reduce phase, and a new block placement policy of HDFS based on the disk wear information to manage SSDs' lifetime. The simplified map output collector and the pre-read model of HDFS show 30% and 18% of performance improvement with Terasort and DFSIO benchmarks, respectively. The modified reduce scheduler shows 12% faster execution time with a real MapReduce application. To extend these results, we affirm that the modified structure also achieves 21% performance improvement on Samsung's MicroBrick-based hyperscale system.