{"title":"基于分析工作负载信息增益算法设计的mapreduce与hive的比较研究","authors":"S. Bagui, Sharon K. John, John P. Baggs","doi":"10.1145/3190645.3190705","DOIUrl":null,"url":null,"abstract":"Information Gain (IG) or the Kullback Leibler algorithm is a statistical algorithm that is employed to extract useful features from datasets to eliminate redundant and valueless features. Applying this feature selection technique paves way for sophisticated analysis on Big Data, requiring the underlying framework to handle the data complexity, volume and velocity. The Hadoop ecosystem comes in handy, enabling for seamless distributed computing leveraging the computing potential of many commercial machines. Previous research studies [1, 2] indicate that Hive is best suited for data warehousing and ETL (Extract, Transform, Load) workloads. We aim to extend Hive's capability to analyze how it suits analytical algorithms and compare its performance with MapReduce. In this Big Data era, it is essential to design algorithms efficiently to reap the benefits of parallelization over existing frameworks. This study will showcase the efficacy in designing IG for Hadoop framework and discuss the implementation of IG for analytical workload on Hive and MapReduce. Inherently both these components are built over a shared nothing architecture which prevents contention issues increasing data parallelism, thus best-fitting for analytical workloads. Hence, the programmer is relieved from the overhead of maintaining structures like indexes, caches and partitions. Assessing implementation of Information Gain on both these parallel processing components will certainly provide insights on the benefits and downsides that each component should offer and at large will enable researchers and developers to employ appropriate components for suitable tasks.","PeriodicalId":403177,"journal":{"name":"Proceedings of the ACMSE 2018 Conference","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A comparative study of mapreduce and hive based on the design of the information gain algorithm for analytical workloads\",\"authors\":\"S. Bagui, Sharon K. John, John P. Baggs\",\"doi\":\"10.1145/3190645.3190705\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Information Gain (IG) or the Kullback Leibler algorithm is a statistical algorithm that is employed to extract useful features from datasets to eliminate redundant and valueless features. Applying this feature selection technique paves way for sophisticated analysis on Big Data, requiring the underlying framework to handle the data complexity, volume and velocity. The Hadoop ecosystem comes in handy, enabling for seamless distributed computing leveraging the computing potential of many commercial machines. Previous research studies [1, 2] indicate that Hive is best suited for data warehousing and ETL (Extract, Transform, Load) workloads. We aim to extend Hive's capability to analyze how it suits analytical algorithms and compare its performance with MapReduce. In this Big Data era, it is essential to design algorithms efficiently to reap the benefits of parallelization over existing frameworks. This study will showcase the efficacy in designing IG for Hadoop framework and discuss the implementation of IG for analytical workload on Hive and MapReduce. Inherently both these components are built over a shared nothing architecture which prevents contention issues increasing data parallelism, thus best-fitting for analytical workloads. Hence, the programmer is relieved from the overhead of maintaining structures like indexes, caches and partitions. Assessing implementation of Information Gain on both these parallel processing components will certainly provide insights on the benefits and downsides that each component should offer and at large will enable researchers and developers to employ appropriate components for suitable tasks.\",\"PeriodicalId\":403177,\"journal\":{\"name\":\"Proceedings of the ACMSE 2018 Conference\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACMSE 2018 Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3190645.3190705\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACMSE 2018 Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3190645.3190705","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
信息增益(Information Gain, IG)或Kullback Leibler算法是一种从数据集中提取有用特征以消除冗余和无价值特征的统计算法。这种特征选择技术的应用为大数据的复杂分析铺平了道路,需要底层框架来处理数据的复杂性、数量和速度。Hadoop生态系统可以派上用场,利用许多商用机器的计算潜力,实现无缝的分布式计算。先前的研究[1,2]表明Hive最适合数据仓库和ETL (Extract, Transform, Load)工作负载。我们的目标是扩展Hive的能力,分析它如何适合分析算法,并将其性能与MapReduce进行比较。在这个大数据时代,有效地设计算法以获得并行化优于现有框架的好处是至关重要的。本研究将展示IG在Hadoop框架设计中的有效性,并讨论IG在Hive和MapReduce上分析工作负载的实现。从本质上讲,这两个组件都构建在一个无共享架构之上,该架构可以防止争用问题增加数据并行性,因此最适合分析工作负载。因此,程序员从维护索引、缓存和分区等结构的开销中解脱出来。在这两个并行处理组件上评估Information Gain的实现肯定会提供每个组件应该提供的优点和缺点的见解,并且总体上使研究人员和开发人员能够为合适的任务使用适当的组件。
A comparative study of mapreduce and hive based on the design of the information gain algorithm for analytical workloads
Information Gain (IG) or the Kullback Leibler algorithm is a statistical algorithm that is employed to extract useful features from datasets to eliminate redundant and valueless features. Applying this feature selection technique paves way for sophisticated analysis on Big Data, requiring the underlying framework to handle the data complexity, volume and velocity. The Hadoop ecosystem comes in handy, enabling for seamless distributed computing leveraging the computing potential of many commercial machines. Previous research studies [1, 2] indicate that Hive is best suited for data warehousing and ETL (Extract, Transform, Load) workloads. We aim to extend Hive's capability to analyze how it suits analytical algorithms and compare its performance with MapReduce. In this Big Data era, it is essential to design algorithms efficiently to reap the benefits of parallelization over existing frameworks. This study will showcase the efficacy in designing IG for Hadoop framework and discuss the implementation of IG for analytical workload on Hive and MapReduce. Inherently both these components are built over a shared nothing architecture which prevents contention issues increasing data parallelism, thus best-fitting for analytical workloads. Hence, the programmer is relieved from the overhead of maintaining structures like indexes, caches and partitions. Assessing implementation of Information Gain on both these parallel processing components will certainly provide insights on the benefits and downsides that each component should offer and at large will enable researchers and developers to employ appropriate components for suitable tasks.