Unbinds data and tasks to improving the Hadoop performance

15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD) Pub Date : 2014-09-01 DOI:10.1109/SNPD.2014.6888710

Kun Lu, Dong Dai, Xuehai Zhou, Mingming Sun, Changlong Li, Hang Zhuang

{"title":"Unbinds data and tasks to improving the Hadoop performance","authors":"Kun Lu, Dong Dai, Xuehai Zhou, Mingming Sun, Changlong Li, Hang Zhuang","doi":"10.1109/SNPD.2014.6888710","DOIUrl":null,"url":null,"abstract":"Hadoop is a popular framework that provides easy programming interface of parallel programs to process large scale of data on clusters of commodity machines. Data intensive programs are the important part running on the cluster especially in large scale machine learning algorithm which executes of the same program iteratively. In-memory cache of input data is an efficient way to speed up these data intensive programs. However, we cannot be able to load all the data in memory because of the limitation of memory capacity. So, the key challenge is how we can accurately know when data should be cached in memory and when it ought to be released. The other problem is that memory capacity may even not enough to hold the input data of the running program. This leads to there is some data cannot be cached in memory. Prefetching is an effective method for such situation. We provide a unbinding technology which do not put the programs and data binded together before the real computation start. With unbinding technology, Hadoop can get a better performance when using caching and prefetching technology. We provide a Hadoop framework with unbinding technology named unbinding-Hadoop which decide the map tasks' input data in the map starting up phase, not at the job submission phase. Prefetching as well can be used in unbinding-Hadoop and can get better performance compared with the programs without unbinding. Evaluations on this system show that unbinding-Hadoop reduces the execution time of jobs by 40.2% and 29.2% with WordCount programs and K-means algorithm.","PeriodicalId":272932,"journal":{"name":"15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SNPD.2014.6888710","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Hadoop is a popular framework that provides easy programming interface of parallel programs to process large scale of data on clusters of commodity machines. Data intensive programs are the important part running on the cluster especially in large scale machine learning algorithm which executes of the same program iteratively. In-memory cache of input data is an efficient way to speed up these data intensive programs. However, we cannot be able to load all the data in memory because of the limitation of memory capacity. So, the key challenge is how we can accurately know when data should be cached in memory and when it ought to be released. The other problem is that memory capacity may even not enough to hold the input data of the running program. This leads to there is some data cannot be cached in memory. Prefetching is an effective method for such situation. We provide a unbinding technology which do not put the programs and data binded together before the real computation start. With unbinding technology, Hadoop can get a better performance when using caching and prefetching technology. We provide a Hadoop framework with unbinding technology named unbinding-Hadoop which decide the map tasks' input data in the map starting up phase, not at the job submission phase. Prefetching as well can be used in unbinding-Hadoop and can get better performance compared with the programs without unbinding. Evaluations on this system show that unbinding-Hadoop reduces the execution time of jobs by 40.2% and 29.2% with WordCount programs and K-means algorithm.

查看原文本刊更多论文

解除数据和任务的绑定，以提高Hadoop性能

Hadoop是一个流行的框架，它提供了并行程序的简单编程接口，可以在商用机器集群上处理大规模数据。数据密集型程序是运行在集群上的重要组成部分，特别是在对同一程序进行迭代执行的大规模机器学习算法中。输入数据的内存缓存是提高这些数据密集型程序速度的有效方法。但是，由于内存容量的限制，我们无法将所有数据加载到内存中。因此，关键的挑战是我们如何准确地知道什么时候应该将数据缓存到内存中，什么时候应该释放数据。另一个问题是，内存容量甚至可能不足以保存运行程序的输入数据。这就导致有一些数据无法缓存在内存中。预取是解决这种情况的有效方法。我们提供了一种解绑定技术，在实际计算开始之前，不需要将程序和数据绑定在一起。通过解绑定技术，Hadoop可以在使用缓存和预取技术时获得更好的性能。我们提供了一个具有解绑定技术的Hadoop框架，名为unbinding-Hadoop，它在map启动阶段决定map任务的输入数据，而不是在job提交阶段。预取也可以用于解绑定hadoop，与未解绑定的程序相比，可以获得更好的性能。对该系统的评估表明，unbinding-Hadoop使用WordCount程序和K-means算法分别减少了40.2%和29.2%的作业执行时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)

自引率

0.00%

发文量