基于分析工作负载信息增益算法设计的mapreduce与hive的比较研究

S. Bagui, Sharon K. John, John P. Baggs
{"title":"基于分析工作负载信息增益算法设计的mapreduce与hive的比较研究","authors":"S. Bagui, Sharon K. John, John P. Baggs","doi":"10.1145/3190645.3190705","DOIUrl":null,"url":null,"abstract":"Information Gain (IG) or the Kullback Leibler algorithm is a statistical algorithm that is employed to extract useful features from datasets to eliminate redundant and valueless features. Applying this feature selection technique paves way for sophisticated analysis on Big Data, requiring the underlying framework to handle the data complexity, volume and velocity. The Hadoop ecosystem comes in handy, enabling for seamless distributed computing leveraging the computing potential of many commercial machines. Previous research studies [1, 2] indicate that Hive is best suited for data warehousing and ETL (Extract, Transform, Load) workloads. We aim to extend Hive's capability to analyze how it suits analytical algorithms and compare its performance with MapReduce. In this Big Data era, it is essential to design algorithms efficiently to reap the benefits of parallelization over existing frameworks. This study will showcase the efficacy in designing IG for Hadoop framework and discuss the implementation of IG for analytical workload on Hive and MapReduce. Inherently both these components are built over a shared nothing architecture which prevents contention issues increasing data parallelism, thus best-fitting for analytical workloads. Hence, the programmer is relieved from the overhead of maintaining structures like indexes, caches and partitions. Assessing implementation of Information Gain on both these parallel processing components will certainly provide insights on the benefits and downsides that each component should offer and at large will enable researchers and developers to employ appropriate components for suitable tasks.","PeriodicalId":403177,"journal":{"name":"Proceedings of the ACMSE 2018 Conference","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A comparative study of mapreduce and hive based on the design of the information gain algorithm for analytical workloads\",\"authors\":\"S. Bagui, Sharon K. John, John P. Baggs\",\"doi\":\"10.1145/3190645.3190705\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Information Gain (IG) or the Kullback Leibler algorithm is a statistical algorithm that is employed to extract useful features from datasets to eliminate redundant and valueless features. Applying this feature selection technique paves way for sophisticated analysis on Big Data, requiring the underlying framework to handle the data complexity, volume and velocity. The Hadoop ecosystem comes in handy, enabling for seamless distributed computing leveraging the computing potential of many commercial machines. Previous research studies [1, 2] indicate that Hive is best suited for data warehousing and ETL (Extract, Transform, Load) workloads. We aim to extend Hive's capability to analyze how it suits analytical algorithms and compare its performance with MapReduce. In this Big Data era, it is essential to design algorithms efficiently to reap the benefits of parallelization over existing frameworks. This study will showcase the efficacy in designing IG for Hadoop framework and discuss the implementation of IG for analytical workload on Hive and MapReduce. Inherently both these components are built over a shared nothing architecture which prevents contention issues increasing data parallelism, thus best-fitting for analytical workloads. Hence, the programmer is relieved from the overhead of maintaining structures like indexes, caches and partitions. Assessing implementation of Information Gain on both these parallel processing components will certainly provide insights on the benefits and downsides that each component should offer and at large will enable researchers and developers to employ appropriate components for suitable tasks.\",\"PeriodicalId\":403177,\"journal\":{\"name\":\"Proceedings of the ACMSE 2018 Conference\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACMSE 2018 Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3190645.3190705\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACMSE 2018 Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3190645.3190705","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

信息增益(Information Gain, IG)或Kullback Leibler算法是一种从数据集中提取有用特征以消除冗余和无价值特征的统计算法。这种特征选择技术的应用为大数据的复杂分析铺平了道路,需要底层框架来处理数据的复杂性、数量和速度。Hadoop生态系统可以派上用场,利用许多商用机器的计算潜力,实现无缝的分布式计算。先前的研究[1,2]表明Hive最适合数据仓库和ETL (Extract, Transform, Load)工作负载。我们的目标是扩展Hive的能力,分析它如何适合分析算法,并将其性能与MapReduce进行比较。在这个大数据时代,有效地设计算法以获得并行化优于现有框架的好处是至关重要的。本研究将展示IG在Hadoop框架设计中的有效性,并讨论IG在Hive和MapReduce上分析工作负载的实现。从本质上讲,这两个组件都构建在一个无共享架构之上,该架构可以防止争用问题增加数据并行性,因此最适合分析工作负载。因此,程序员从维护索引、缓存和分区等结构的开销中解脱出来。在这两个并行处理组件上评估Information Gain的实现肯定会提供每个组件应该提供的优点和缺点的见解,并且总体上使研究人员和开发人员能够为合适的任务使用适当的组件。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A comparative study of mapreduce and hive based on the design of the information gain algorithm for analytical workloads
Information Gain (IG) or the Kullback Leibler algorithm is a statistical algorithm that is employed to extract useful features from datasets to eliminate redundant and valueless features. Applying this feature selection technique paves way for sophisticated analysis on Big Data, requiring the underlying framework to handle the data complexity, volume and velocity. The Hadoop ecosystem comes in handy, enabling for seamless distributed computing leveraging the computing potential of many commercial machines. Previous research studies [1, 2] indicate that Hive is best suited for data warehousing and ETL (Extract, Transform, Load) workloads. We aim to extend Hive's capability to analyze how it suits analytical algorithms and compare its performance with MapReduce. In this Big Data era, it is essential to design algorithms efficiently to reap the benefits of parallelization over existing frameworks. This study will showcase the efficacy in designing IG for Hadoop framework and discuss the implementation of IG for analytical workload on Hive and MapReduce. Inherently both these components are built over a shared nothing architecture which prevents contention issues increasing data parallelism, thus best-fitting for analytical workloads. Hence, the programmer is relieved from the overhead of maintaining structures like indexes, caches and partitions. Assessing implementation of Information Gain on both these parallel processing components will certainly provide insights on the benefits and downsides that each component should offer and at large will enable researchers and developers to employ appropriate components for suitable tasks.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信