Apache Tez和MapReduce在Hadoop集群上数据压缩的性能比较

2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE) Pub Date : 2017-07-01 DOI:10.1109/JCSSE.2017.8025950

Kritwara Rattanaopas

{"title":"Apache Tez和MapReduce在Hadoop集群上数据压缩的性能比较","authors":"Kritwara Rattanaopas","doi":"10.1109/JCSSE.2017.8025950","DOIUrl":null,"url":null,"abstract":"Big data is a popular topic on cloud computing research. The main characteristics of big data are volume, velocity and variety. These characteristics are difficult to handle by using traditional softwares and methods. Hadoop is open-source framework software which was developed to provide solutions for handling several domains of big data problems. For big data analytic, MapReduce framework is a main engine of Hadoop cluster and widely used nowadays. It uses a batch oriented processing. Apache also developed an alternative engine called “Tez”. It supports an interactive query and does not write temporary data into HDFS. In this paper, we focus on the performance comparison between MapReduce and Tez. We also investigate the performance of these two engines with the compression of input files and map output files. Bzip is a compression algorithm used for input files and snappy is used for map output files. Word-count and terasort benchmarks are used in our experiments. For the word-count benchmark, the results show that Tez engine always has better execution-time than MapReduce engine for both of compressed data or non-compressed data. It can reduce an execution-time up to 39% comparing with the execution time of MapReduce engine. In contrast, the results show that Tez engine usually has higher execution-time than MapReduce engine up to 13% for terasort benchmark. The results also show that the performance of compressing map output files with snappy provides better performance on execution time for both benchmarks.","PeriodicalId":6460,"journal":{"name":"2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE)","volume":"42 1","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"A performance comparison of Apache Tez and MapReduce with data compression on Hadoop cluster\",\"authors\":\"Kritwara Rattanaopas\",\"doi\":\"10.1109/JCSSE.2017.8025950\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Big data is a popular topic on cloud computing research. The main characteristics of big data are volume, velocity and variety. These characteristics are difficult to handle by using traditional softwares and methods. Hadoop is open-source framework software which was developed to provide solutions for handling several domains of big data problems. For big data analytic, MapReduce framework is a main engine of Hadoop cluster and widely used nowadays. It uses a batch oriented processing. Apache also developed an alternative engine called “Tez”. It supports an interactive query and does not write temporary data into HDFS. In this paper, we focus on the performance comparison between MapReduce and Tez. We also investigate the performance of these two engines with the compression of input files and map output files. Bzip is a compression algorithm used for input files and snappy is used for map output files. Word-count and terasort benchmarks are used in our experiments. For the word-count benchmark, the results show that Tez engine always has better execution-time than MapReduce engine for both of compressed data or non-compressed data. It can reduce an execution-time up to 39% comparing with the execution time of MapReduce engine. In contrast, the results show that Tez engine usually has higher execution-time than MapReduce engine up to 13% for terasort benchmark. The results also show that the performance of compressing map output files with snappy provides better performance on execution time for both benchmarks.\",\"PeriodicalId\":6460,\"journal\":{\"name\":\"2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE)\",\"volume\":\"42 1\",\"pages\":\"1-5\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/JCSSE.2017.8025950\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCSSE.2017.8025950","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

大数据是云计算研究的热门话题。大数据的主要特点是量大、速度快、种类多。这些特点是传统的软件和方法难以处理的。Hadoop是开源框架软件，它的开发是为了提供解决方案来处理几个领域的大数据问题。对于大数据分析，MapReduce框架是Hadoop集群的主要引擎，目前应用广泛。它使用面向批处理的处理。Apache还开发了一种名为“Tez”的替代引擎。它支持交互式查询，不将临时数据写入HDFS。在本文中，我们着重于MapReduce和Tez之间的性能比较。我们还研究了这两个引擎在压缩输入文件和映射输出文件方面的性能。Bzip是用于输入文件的压缩算法，snappy用于映射输出文件。在我们的实验中使用了单词计数和分类基准。对于单词计数的基准测试，结果表明Tez引擎无论对压缩数据还是非压缩数据都比MapReduce引擎有更好的执行时间。与MapReduce引擎相比，它可以减少高达39%的执行时间。相比之下，结果表明Tez引擎通常比MapReduce引擎具有更高的执行时间，在terassort基准测试中高达13%。结果还表明，使用snappy压缩映射输出文件的性能为两个基准测试提供了更好的执行时间性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A performance comparison of Apache Tez and MapReduce with data compression on Hadoop cluster

Big data is a popular topic on cloud computing research. The main characteristics of big data are volume, velocity and variety. These characteristics are difficult to handle by using traditional softwares and methods. Hadoop is open-source framework software which was developed to provide solutions for handling several domains of big data problems. For big data analytic, MapReduce framework is a main engine of Hadoop cluster and widely used nowadays. It uses a batch oriented processing. Apache also developed an alternative engine called “Tez”. It supports an interactive query and does not write temporary data into HDFS. In this paper, we focus on the performance comparison between MapReduce and Tez. We also investigate the performance of these two engines with the compression of input files and map output files. Bzip is a compression algorithm used for input files and snappy is used for map output files. Word-count and terasort benchmarks are used in our experiments. For the word-count benchmark, the results show that Tez engine always has better execution-time than MapReduce engine for both of compressed data or non-compressed data. It can reduce an execution-time up to 39% comparing with the execution time of MapReduce engine. In contrast, the results show that Tez engine usually has higher execution-time than MapReduce engine up to 13% for terasort benchmark. The results also show that the performance of compressing map output files with snappy provides better performance on execution time for both benchmarks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE)

自引率

0.00%

发文量