TS-Hadoop: Handling access skew in MapReduce by using tiered storage infrastructure

2014 International Conference on Information and Communication Technology Convergence (ICTC) Pub Date : 2014-12-15 DOI:10.1109/ICTC.2014.6983331

Zhanye Wang, Jing Li, Tao Xu, Yu Gu, Dongsheng Wang

{"title":"TS-Hadoop: Handling access skew in MapReduce by using tiered storage infrastructure","authors":"Zhanye Wang, Jing Li, Tao Xu, Yu Gu, Dongsheng Wang","doi":"10.1109/ICTC.2014.6983331","DOIUrl":null,"url":null,"abstract":"Over the last few years, MapReduce systems has become popular for processing large-scale data sets and are increasingly being used in web indexing, data mining, and machine learning. Unlike simple application scenarios such as word count, many applications of MapReduce exhibit strong skewed access patterns in real production environment, the data access is non-uniform, often only a small portion of data are accessed far more frequently than others. Clearly, handling these hot data efficiently is quite critical to the overall performance of the MapReduce computation. In this paper, we present TS-Hadoop, a MapReduce system based on Apache Hadoop. The most significant feature of TS-Hadoop is that it utilizes tiered storage infrastructure, besides HDFS, TS-Hadoop also has a shared-disk cluster called HCache, it can be guaranteed that the data in HCache could be processed in highly parallel way. TS-Hadoop automatically distinguish hot and cold data based on current workload, and move them into HCache and HDFS respectively, the hot data in HCache could would be processed efficiently. Experiments show that the average execution time of MapReduce jobs in TS-Hadoop is much faster than traditional Hadoop platform when facing access skew workloads.","PeriodicalId":299228,"journal":{"name":"2014 International Conference on Information and Communication Technology Convergence (ICTC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Information and Communication Technology Convergence (ICTC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTC.2014.6983331","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Over the last few years, MapReduce systems has become popular for processing large-scale data sets and are increasingly being used in web indexing, data mining, and machine learning. Unlike simple application scenarios such as word count, many applications of MapReduce exhibit strong skewed access patterns in real production environment, the data access is non-uniform, often only a small portion of data are accessed far more frequently than others. Clearly, handling these hot data efficiently is quite critical to the overall performance of the MapReduce computation. In this paper, we present TS-Hadoop, a MapReduce system based on Apache Hadoop. The most significant feature of TS-Hadoop is that it utilizes tiered storage infrastructure, besides HDFS, TS-Hadoop also has a shared-disk cluster called HCache, it can be guaranteed that the data in HCache could be processed in highly parallel way. TS-Hadoop automatically distinguish hot and cold data based on current workload, and move them into HCache and HDFS respectively, the hot data in HCache could would be processed efficiently. Experiments show that the average execution time of MapReduce jobs in TS-Hadoop is much faster than traditional Hadoop platform when facing access skew workloads.

查看原文本刊更多论文

TS-Hadoop:通过使用分级存储基础设施来处理MapReduce中的访问倾斜

在过去的几年里，MapReduce系统在处理大规模数据集方面变得越来越流行，并且越来越多地用于web索引、数据挖掘和机器学习。与字数统计等简单应用场景不同，MapReduce的许多应用在实际生产环境中表现出强烈的歪斜访问模式，数据访问不统一，通常只有一小部分数据的访问频率远远高于其他数据。显然，有效地处理这些热数据对MapReduce计算的整体性能至关重要。本文提出了一种基于Apache Hadoop的MapReduce系统TS-Hadoop。TS-Hadoop最显著的特点是它采用了分层存储基础设施，除了HDFS之外，TS-Hadoop还有一个名为HCache的共享磁盘集群，可以保证HCache中的数据可以高度并行地处理。TS-Hadoop根据当前的工作负载自动区分热数据和冷数据，分别移动到HCache和HDFS中，HCache中的热数据可以得到高效的处理。实验表明，面对访问倾斜的工作负载，TS-Hadoop中MapReduce作业的平均执行时间比传统Hadoop平台要快得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 International Conference on Information and Communication Technology Convergence (ICTC)

自引率

0.00%

发文量