改善顺序MapReduce作业的ReduceTask数据局部性

2013 Proceedings IEEE INFOCOM Pub Date : 2013-04-14 DOI:10.1109/INFCOM.2013.6566959

Jian Tan, S. Meng, Xiaoqiao Meng, Li Zhang

{"title":"改善顺序MapReduce作业的ReduceTask数据局部性","authors":"Jian Tan, S. Meng, Xiaoqiao Meng, Li Zhang","doi":"10.1109/INFCOM.2013.6566959","DOIUrl":null,"url":null,"abstract":"Improving data locality for MapReduce jobs is critical for the performance of large-scale Hadoop clusters, embodying the principle of moving computation close to data for big data platforms. Scheduling tasks in the vicinity of stored data can significantly diminish network traffic, which is crucial for system stability and efficiency. Though the issue on data locality has been investigated extensively for MapTasks, most of the existing schedulers ignore data locality for ReduceTasks when fetching the intermediate data, causing performance degradation. This problem of reducing the fetching cost for ReduceTasks has been identified recently. However, the proposed solutions are exclusively based on a greedy approach, relying on the intuition to place ReduceTasks to the slots that are closest to the majority of the already generated intermediate data. The consequence is that, in presence of job arrivals and departures, assigning the ReduceTasks of the current job to the nodes with the lowest fetching cost can prevent a subsequent job with even better match of data locality from being launched on the already taken slots. To this end, we formulate a stochastic optimization framework to improve the data locality for ReduceTasks, with the optimal placement policy exhibiting a threshold-based structure. In order to ease the implementation, we further propose a receding horizon control policy based on the optimal solution under restricted conditions. The improved performance is further validated through simulation experiments and real performance tests on our testbed.","PeriodicalId":206346,"journal":{"name":"2013 Proceedings IEEE INFOCOM","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"68","resultStr":"{\"title\":\"Improving ReduceTask data locality for sequential MapReduce jobs\",\"authors\":\"Jian Tan, S. Meng, Xiaoqiao Meng, Li Zhang\",\"doi\":\"10.1109/INFCOM.2013.6566959\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Improving data locality for MapReduce jobs is critical for the performance of large-scale Hadoop clusters, embodying the principle of moving computation close to data for big data platforms. Scheduling tasks in the vicinity of stored data can significantly diminish network traffic, which is crucial for system stability and efficiency. Though the issue on data locality has been investigated extensively for MapTasks, most of the existing schedulers ignore data locality for ReduceTasks when fetching the intermediate data, causing performance degradation. This problem of reducing the fetching cost for ReduceTasks has been identified recently. However, the proposed solutions are exclusively based on a greedy approach, relying on the intuition to place ReduceTasks to the slots that are closest to the majority of the already generated intermediate data. The consequence is that, in presence of job arrivals and departures, assigning the ReduceTasks of the current job to the nodes with the lowest fetching cost can prevent a subsequent job with even better match of data locality from being launched on the already taken slots. To this end, we formulate a stochastic optimization framework to improve the data locality for ReduceTasks, with the optimal placement policy exhibiting a threshold-based structure. In order to ease the implementation, we further propose a receding horizon control policy based on the optimal solution under restricted conditions. The improved performance is further validated through simulation experiments and real performance tests on our testbed.\",\"PeriodicalId\":206346,\"journal\":{\"name\":\"2013 Proceedings IEEE INFOCOM\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-04-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"68\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 Proceedings IEEE INFOCOM\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INFCOM.2013.6566959\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 Proceedings IEEE INFOCOM","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INFCOM.2013.6566959","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 68

摘要

改善MapReduce作业的数据局部性对于大规模Hadoop集群的性能至关重要，体现了大数据平台将计算移向数据的原则。在存储数据附近调度任务可以显著减少网络流量，这对系统的稳定性和效率至关重要。尽管MapTasks的数据局部性问题已经得到了广泛的研究，但是大多数现有的调度器在获取中间数据时忽略了reducetask的数据局部性，从而导致性能下降。减少ReduceTasks获取成本的问题最近被发现了。然而，建议的解决方案完全基于贪婪方法，依靠直觉将reducetask放置到最接近大多数已经生成的中间数据的槽中。其结果是，在存在作业到达和离开的情况下，将当前作业的reducetask分配给获取成本最低的节点，可以防止在已经占用的槽上启动具有更好的数据位置匹配的后续作业。为此，我们制定了一个随机优化框架来改善reducetask的数据局部性，其中最优放置策略表现出基于阈值的结构。为了便于实施，我们进一步提出了一种基于受限条件下最优解的后退地平线控制策略。通过仿真实验和实际性能测试，进一步验证了改进后的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving ReduceTask data locality for sequential MapReduce jobs

Improving data locality for MapReduce jobs is critical for the performance of large-scale Hadoop clusters, embodying the principle of moving computation close to data for big data platforms. Scheduling tasks in the vicinity of stored data can significantly diminish network traffic, which is crucial for system stability and efficiency. Though the issue on data locality has been investigated extensively for MapTasks, most of the existing schedulers ignore data locality for ReduceTasks when fetching the intermediate data, causing performance degradation. This problem of reducing the fetching cost for ReduceTasks has been identified recently. However, the proposed solutions are exclusively based on a greedy approach, relying on the intuition to place ReduceTasks to the slots that are closest to the majority of the already generated intermediate data. The consequence is that, in presence of job arrivals and departures, assigning the ReduceTasks of the current job to the nodes with the lowest fetching cost can prevent a subsequent job with even better match of data locality from being launched on the already taken slots. To this end, we formulate a stochastic optimization framework to improve the data locality for ReduceTasks, with the optimal placement policy exhibiting a threshold-based structure. In order to ease the implementation, we further propose a receding horizon control policy based on the optimal solution under restricted conditions. The improved performance is further validated through simulation experiments and real performance tests on our testbed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 Proceedings IEEE INFOCOM

自引率

0.00%

发文量