{"title":"A comparative study of mapreduce and hive based on the design of the information gain algorithm for analytical workloads","authors":"S. Bagui, Sharon K. John, John P. Baggs","doi":"10.1145/3190645.3190705","DOIUrl":null,"url":null,"abstract":"Information Gain (IG) or the Kullback Leibler algorithm is a statistical algorithm that is employed to extract useful features from datasets to eliminate redundant and valueless features. Applying this feature selection technique paves way for sophisticated analysis on Big Data, requiring the underlying framework to handle the data complexity, volume and velocity. The Hadoop ecosystem comes in handy, enabling for seamless distributed computing leveraging the computing potential of many commercial machines. Previous research studies [1, 2] indicate that Hive is best suited for data warehousing and ETL (Extract, Transform, Load) workloads. We aim to extend Hive's capability to analyze how it suits analytical algorithms and compare its performance with MapReduce. In this Big Data era, it is essential to design algorithms efficiently to reap the benefits of parallelization over existing frameworks. This study will showcase the efficacy in designing IG for Hadoop framework and discuss the implementation of IG for analytical workload on Hive and MapReduce. Inherently both these components are built over a shared nothing architecture which prevents contention issues increasing data parallelism, thus best-fitting for analytical workloads. Hence, the programmer is relieved from the overhead of maintaining structures like indexes, caches and partitions. Assessing implementation of Information Gain on both these parallel processing components will certainly provide insights on the benefits and downsides that each component should offer and at large will enable researchers and developers to employ appropriate components for suitable tasks.","PeriodicalId":403177,"journal":{"name":"Proceedings of the ACMSE 2018 Conference","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACMSE 2018 Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3190645.3190705","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Information Gain (IG) or the Kullback Leibler algorithm is a statistical algorithm that is employed to extract useful features from datasets to eliminate redundant and valueless features. Applying this feature selection technique paves way for sophisticated analysis on Big Data, requiring the underlying framework to handle the data complexity, volume and velocity. The Hadoop ecosystem comes in handy, enabling for seamless distributed computing leveraging the computing potential of many commercial machines. Previous research studies [1, 2] indicate that Hive is best suited for data warehousing and ETL (Extract, Transform, Load) workloads. We aim to extend Hive's capability to analyze how it suits analytical algorithms and compare its performance with MapReduce. In this Big Data era, it is essential to design algorithms efficiently to reap the benefits of parallelization over existing frameworks. This study will showcase the efficacy in designing IG for Hadoop framework and discuss the implementation of IG for analytical workload on Hive and MapReduce. Inherently both these components are built over a shared nothing architecture which prevents contention issues increasing data parallelism, thus best-fitting for analytical workloads. Hence, the programmer is relieved from the overhead of maintaining structures like indexes, caches and partitions. Assessing implementation of Information Gain on both these parallel processing components will certainly provide insights on the benefits and downsides that each component should offer and at large will enable researchers and developers to employ appropriate components for suitable tasks.