Automated Table Partitioner (ATAP) in Apache Hive

2018 4th International Conference on Computer and Information Sciences (ICCOINS) Pub Date : 2018-08-01 DOI:10.1109/ICCOINS.2018.8510580

Thivviyan Amirthalingam, H. Rais

{"title":"Automated Table Partitioner (ATAP) in Apache Hive","authors":"Thivviyan Amirthalingam, H. Rais","doi":"10.1109/ICCOINS.2018.8510580","DOIUrl":null,"url":null,"abstract":"Big Data and Predictive Analytics have been a game-changing paradigm in academia and industry for the past decade, inspiring numerous efforts in multiple spaces. One of many such technologies is Hadoop, an open-sourced framework based on MapReduce for highly distributive and scalable solutions. As Hadoop became more popular, other technologies were built, making it an ecosystem by itself. Currently, there are hundreds of tools and utilities that add-on to the Hadoop framework, and Apache Hive is one of the most prominent options. Hive is built as a data warehousing layer that interacts with Hadoop and the underlying filesystem, HDFS. It quickly became the market leader in query processing as it provides better user experience than MapReduce. Nevertheless, it imposes rigid structures that are unyielding to the ever changing nature of data. This paper proposes a novel mean of automating the table partitioning in Hive. It includes a lexical analyzer that reads HiveQL queries and, in return, issues Data Definition Language (DDL) for table restructure if a particular column is read more than the user-set coefficient factor. Multiple experiment made for this research have returned results that further solidified this proof of concept for its feasibility, adaptability and usability.","PeriodicalId":168165,"journal":{"name":"2018 4th International Conference on Computer and Information Sciences (ICCOINS)","volume":"5 6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 4th International Conference on Computer and Information Sciences (ICCOINS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCOINS.2018.8510580","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Big Data and Predictive Analytics have been a game-changing paradigm in academia and industry for the past decade, inspiring numerous efforts in multiple spaces. One of many such technologies is Hadoop, an open-sourced framework based on MapReduce for highly distributive and scalable solutions. As Hadoop became more popular, other technologies were built, making it an ecosystem by itself. Currently, there are hundreds of tools and utilities that add-on to the Hadoop framework, and Apache Hive is one of the most prominent options. Hive is built as a data warehousing layer that interacts with Hadoop and the underlying filesystem, HDFS. It quickly became the market leader in query processing as it provides better user experience than MapReduce. Nevertheless, it imposes rigid structures that are unyielding to the ever changing nature of data. This paper proposes a novel mean of automating the table partitioning in Hive. It includes a lexical analyzer that reads HiveQL queries and, in return, issues Data Definition Language (DDL) for table restructure if a particular column is read more than the user-set coefficient factor. Multiple experiment made for this research have returned results that further solidified this proof of concept for its feasibility, adaptability and usability.

查看原文本刊更多论文

Apache Hive中的ATAP (Automated Table Partitioner)

在过去的十年里，大数据和预测分析已经成为学术界和工业界的一个改变游戏规则的范例，在多个领域激发了许多努力。Hadoop是众多此类技术之一，它是一个基于MapReduce的开源框架，用于提供高度分布式和可扩展的解决方案。随着Hadoop变得越来越受欢迎，其他技术被构建，使其本身成为一个生态系统。目前，有数百种工具和实用程序附加到Hadoop框架中，Apache Hive是最突出的选择之一。Hive是作为一个数据仓库层构建的，它与Hadoop和底层文件系统HDFS交互。由于它提供了比MapReduce更好的用户体验，它迅速成为查询处理领域的市场领导者。然而，它强加了僵化的结构，这些结构不屈服于数据不断变化的本质。本文提出了一种在Hive中实现表分区自动化的新方法。它包括一个词法分析器，它读取HiveQL查询，如果某个特定列的读取量超过用户设置的系数因子，它就会发出数据定义语言(Data Definition Language, DDL)进行表重构。本研究的多次实验结果进一步巩固了这一概念的可行性、适应性和可用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 4th International Conference on Computer and Information Sciences (ICCOINS)

自引率

0.00%

发文量