TS-Hadoop: Handling access skew in MapReduce by using tiered storage infrastructure

Zhanye Wang, Jing Li, Tao Xu, Yu Gu, Dongsheng Wang
{"title":"TS-Hadoop: Handling access skew in MapReduce by using tiered storage infrastructure","authors":"Zhanye Wang, Jing Li, Tao Xu, Yu Gu, Dongsheng Wang","doi":"10.1109/ICTC.2014.6983331","DOIUrl":null,"url":null,"abstract":"Over the last few years, MapReduce systems has become popular for processing large-scale data sets and are increasingly being used in web indexing, data mining, and machine learning. Unlike simple application scenarios such as word count, many applications of MapReduce exhibit strong skewed access patterns in real production environment, the data access is non-uniform, often only a small portion of data are accessed far more frequently than others. Clearly, handling these hot data efficiently is quite critical to the overall performance of the MapReduce computation. In this paper, we present TS-Hadoop, a MapReduce system based on Apache Hadoop. The most significant feature of TS-Hadoop is that it utilizes tiered storage infrastructure, besides HDFS, TS-Hadoop also has a shared-disk cluster called HCache, it can be guaranteed that the data in HCache could be processed in highly parallel way. TS-Hadoop automatically distinguish hot and cold data based on current workload, and move them into HCache and HDFS respectively, the hot data in HCache could would be processed efficiently. Experiments show that the average execution time of MapReduce jobs in TS-Hadoop is much faster than traditional Hadoop platform when facing access skew workloads.","PeriodicalId":299228,"journal":{"name":"2014 International Conference on Information and Communication Technology Convergence (ICTC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Information and Communication Technology Convergence (ICTC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTC.2014.6983331","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Over the last few years, MapReduce systems has become popular for processing large-scale data sets and are increasingly being used in web indexing, data mining, and machine learning. Unlike simple application scenarios such as word count, many applications of MapReduce exhibit strong skewed access patterns in real production environment, the data access is non-uniform, often only a small portion of data are accessed far more frequently than others. Clearly, handling these hot data efficiently is quite critical to the overall performance of the MapReduce computation. In this paper, we present TS-Hadoop, a MapReduce system based on Apache Hadoop. The most significant feature of TS-Hadoop is that it utilizes tiered storage infrastructure, besides HDFS, TS-Hadoop also has a shared-disk cluster called HCache, it can be guaranteed that the data in HCache could be processed in highly parallel way. TS-Hadoop automatically distinguish hot and cold data based on current workload, and move them into HCache and HDFS respectively, the hot data in HCache could would be processed efficiently. Experiments show that the average execution time of MapReduce jobs in TS-Hadoop is much faster than traditional Hadoop platform when facing access skew workloads.
TS-Hadoop:通过使用分级存储基础设施来处理MapReduce中的访问倾斜
在过去的几年里,MapReduce系统在处理大规模数据集方面变得越来越流行,并且越来越多地用于web索引、数据挖掘和机器学习。与字数统计等简单应用场景不同,MapReduce的许多应用在实际生产环境中表现出强烈的歪斜访问模式,数据访问不统一,通常只有一小部分数据的访问频率远远高于其他数据。显然,有效地处理这些热数据对MapReduce计算的整体性能至关重要。本文提出了一种基于Apache Hadoop的MapReduce系统TS-Hadoop。TS-Hadoop最显著的特点是它采用了分层存储基础设施,除了HDFS之外,TS-Hadoop还有一个名为HCache的共享磁盘集群,可以保证HCache中的数据可以高度并行地处理。TS-Hadoop根据当前的工作负载自动区分热数据和冷数据,分别移动到HCache和HDFS中,HCache中的热数据可以得到高效的处理。实验表明,面对访问倾斜的工作负载,TS-Hadoop中MapReduce作业的平均执行时间比传统Hadoop平台要快得多。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信