A virtual machine based task scheduling approach to improving data locality for virtualized Hadoop

Ruiqi Sun, Jie Yang, Zhan Gao, Zhiqiang He
{"title":"A virtual machine based task scheduling approach to improving data locality for virtualized Hadoop","authors":"Ruiqi Sun, Jie Yang, Zhan Gao, Zhiqiang He","doi":"10.1109/ICIS.2014.6912150","DOIUrl":null,"url":null,"abstract":"MapReduce emerges as an important distributed programming paradigm for large-scale data analysis applications. As an open-source implementation of MapReduce, Hadoop presents an attractive usage system for many enterprises. There are some drawbacks in a traditional Hadoop cluster deployed with a large scale of physical machines, such as burdensome cluster management and fluctuating resource utilization. Virtualized Hadoop cluster not only simplifies cluster management, but also facilitates cost-effective workload consolidation for resource utilization. In Hadoop system, the data locality is a critical factor impacting on performance of MapReduce applications. However, existing task scheduling approaches to improving data locality of virtualized Hadoop are not effective because of two levels distribution of data: virtual machines and physical servers. In this paper, we deploy virtualized Hadoop cluster in which computing node and storage node are placed in respective virtual machines to improve flexibility. We propose a novel task scheduling approach which aims to improve data locality for virtualized Hadoop cluster through migrating the virtual machine acted as computing node to the physical server running virtual machine acted as storage node that holds a data replica needed by that computing node. We evaluated our approach's efficiency on a virtualized Hadoop cluster with the aforementioned deployment for 11 computing nodes and 12 storage nodes. Our experiment results show that our approach improves performance of 86% typical MapReduce applications in our benchmark suite at varying degrees.","PeriodicalId":237256,"journal":{"name":"2014 IEEE/ACIS 13th International Conference on Computer and Information Science (ICIS)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE/ACIS 13th International Conference on Computer and Information Science (ICIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIS.2014.6912150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

MapReduce emerges as an important distributed programming paradigm for large-scale data analysis applications. As an open-source implementation of MapReduce, Hadoop presents an attractive usage system for many enterprises. There are some drawbacks in a traditional Hadoop cluster deployed with a large scale of physical machines, such as burdensome cluster management and fluctuating resource utilization. Virtualized Hadoop cluster not only simplifies cluster management, but also facilitates cost-effective workload consolidation for resource utilization. In Hadoop system, the data locality is a critical factor impacting on performance of MapReduce applications. However, existing task scheduling approaches to improving data locality of virtualized Hadoop are not effective because of two levels distribution of data: virtual machines and physical servers. In this paper, we deploy virtualized Hadoop cluster in which computing node and storage node are placed in respective virtual machines to improve flexibility. We propose a novel task scheduling approach which aims to improve data locality for virtualized Hadoop cluster through migrating the virtual machine acted as computing node to the physical server running virtual machine acted as storage node that holds a data replica needed by that computing node. We evaluated our approach's efficiency on a virtualized Hadoop cluster with the aforementioned deployment for 11 computing nodes and 12 storage nodes. Our experiment results show that our approach improves performance of 86% typical MapReduce applications in our benchmark suite at varying degrees.
一种改进虚拟化Hadoop数据局部性的基于虚拟机的任务调度方法
MapReduce作为一种重要的分布式编程范例出现在大规模数据分析应用中。作为MapReduce的开源实现,Hadoop为许多企业提供了一个有吸引力的使用系统。使用大规模物理机器部署的传统Hadoop集群存在一些缺点,例如繁重的集群管理和波动的资源利用率。虚拟化的Hadoop集群不仅简化了集群管理,而且可以经济高效地整合工作负载,提高资源利用率。在Hadoop系统中,数据的局部性是影响MapReduce应用性能的一个关键因素。然而,现有的改进虚拟化Hadoop数据局部性的任务调度方法并不有效,因为数据有两层分布:虚拟机和物理服务器。本文采用虚拟化的Hadoop集群,将计算节点和存储节点分别置于虚拟机中,提高了集群的灵活性。我们提出了一种新的任务调度方法,旨在通过将作为计算节点的虚拟机迁移到运行虚拟机作为存储节点的物理服务器上,以保存该计算节点所需的数据副本,从而提高虚拟化Hadoop集群的数据局部性。我们在前面提到的11个计算节点和12个存储节点的虚拟化Hadoop集群上评估了我们的方法的效率。实验结果表明,我们的方法在不同程度上提高了基准套件中86%的典型MapReduce应用程序的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信