A virtual machine based task scheduling approach to improving data locality for virtualized Hadoop

2014 IEEE/ACIS 13th International Conference on Computer and Information Science (ICIS) Pub Date : 2014-06-04 DOI:10.1109/ICIS.2014.6912150

Ruiqi Sun, Jie Yang, Zhan Gao, Zhiqiang He

{"title":"A virtual machine based task scheduling approach to improving data locality for virtualized Hadoop","authors":"Ruiqi Sun, Jie Yang, Zhan Gao, Zhiqiang He","doi":"10.1109/ICIS.2014.6912150","DOIUrl":null,"url":null,"abstract":"MapReduce emerges as an important distributed programming paradigm for large-scale data analysis applications. As an open-source implementation of MapReduce, Hadoop presents an attractive usage system for many enterprises. There are some drawbacks in a traditional Hadoop cluster deployed with a large scale of physical machines, such as burdensome cluster management and fluctuating resource utilization. Virtualized Hadoop cluster not only simplifies cluster management, but also facilitates cost-effective workload consolidation for resource utilization. In Hadoop system, the data locality is a critical factor impacting on performance of MapReduce applications. However, existing task scheduling approaches to improving data locality of virtualized Hadoop are not effective because of two levels distribution of data: virtual machines and physical servers. In this paper, we deploy virtualized Hadoop cluster in which computing node and storage node are placed in respective virtual machines to improve flexibility. We propose a novel task scheduling approach which aims to improve data locality for virtualized Hadoop cluster through migrating the virtual machine acted as computing node to the physical server running virtual machine acted as storage node that holds a data replica needed by that computing node. We evaluated our approach's efficiency on a virtualized Hadoop cluster with the aforementioned deployment for 11 computing nodes and 12 storage nodes. Our experiment results show that our approach improves performance of 86% typical MapReduce applications in our benchmark suite at varying degrees.","PeriodicalId":237256,"journal":{"name":"2014 IEEE/ACIS 13th International Conference on Computer and Information Science (ICIS)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE/ACIS 13th International Conference on Computer and Information Science (ICIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIS.2014.6912150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

MapReduce emerges as an important distributed programming paradigm for large-scale data analysis applications. As an open-source implementation of MapReduce, Hadoop presents an attractive usage system for many enterprises. There are some drawbacks in a traditional Hadoop cluster deployed with a large scale of physical machines, such as burdensome cluster management and fluctuating resource utilization. Virtualized Hadoop cluster not only simplifies cluster management, but also facilitates cost-effective workload consolidation for resource utilization. In Hadoop system, the data locality is a critical factor impacting on performance of MapReduce applications. However, existing task scheduling approaches to improving data locality of virtualized Hadoop are not effective because of two levels distribution of data: virtual machines and physical servers. In this paper, we deploy virtualized Hadoop cluster in which computing node and storage node are placed in respective virtual machines to improve flexibility. We propose a novel task scheduling approach which aims to improve data locality for virtualized Hadoop cluster through migrating the virtual machine acted as computing node to the physical server running virtual machine acted as storage node that holds a data replica needed by that computing node. We evaluated our approach's efficiency on a virtualized Hadoop cluster with the aforementioned deployment for 11 computing nodes and 12 storage nodes. Our experiment results show that our approach improves performance of 86% typical MapReduce applications in our benchmark suite at varying degrees.

查看原文本刊更多论文

一种改进虚拟化Hadoop数据局部性的基于虚拟机的任务调度方法

MapReduce作为一种重要的分布式编程范例出现在大规模数据分析应用中。作为MapReduce的开源实现，Hadoop为许多企业提供了一个有吸引力的使用系统。使用大规模物理机器部署的传统Hadoop集群存在一些缺点，例如繁重的集群管理和波动的资源利用率。虚拟化的Hadoop集群不仅简化了集群管理，而且可以经济高效地整合工作负载，提高资源利用率。在Hadoop系统中，数据的局部性是影响MapReduce应用性能的一个关键因素。然而，现有的改进虚拟化Hadoop数据局部性的任务调度方法并不有效，因为数据有两层分布:虚拟机和物理服务器。本文采用虚拟化的Hadoop集群，将计算节点和存储节点分别置于虚拟机中，提高了集群的灵活性。我们提出了一种新的任务调度方法，旨在通过将作为计算节点的虚拟机迁移到运行虚拟机作为存储节点的物理服务器上，以保存该计算节点所需的数据副本，从而提高虚拟化Hadoop集群的数据局部性。我们在前面提到的11个计算节点和12个存储节点的虚拟化Hadoop集群上评估了我们的方法的效率。实验结果表明，我们的方法在不同程度上提高了基准套件中86%的典型MapReduce应用程序的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE/ACIS 13th International Conference on Computer and Information Science (ICIS)

自引率

0.00%

发文量