Comparing HiveQL and MapReduce methods to process fact data in a data warehouse

2017 2nd International Conference on Communication Systems, Computing and IT Applications (CSCITA) Pub Date : 2017-04-01 DOI:10.1109/CSCITA.2017.8066553

Haince Denis Pen, Prajyoti Dsilva, Sweedle Mascarnes

{"title":"Comparing HiveQL and MapReduce methods to process fact data in a data warehouse","authors":"Haince Denis Pen, Prajyoti Dsilva, Sweedle Mascarnes","doi":"10.1109/CSCITA.2017.8066553","DOIUrl":null,"url":null,"abstract":"Today Big data is one of the most widely spoken about technology that is being explored throughout the world by technology enthusiasts and academic researchers. The reason for this is the enormous data generated every second of each day. Every webpage visited, every text message sent, every post on social networking websites, check-in information, mouse clicks etc. is logged. This data needs to be stored and retrieved efficiently, moreover the data is unstructured therefore the traditional methods of strong data fail. This data needs to be stored and retrieved efficiently There is a need of an efficient, scalable and robust architecture that needs stores enormous amounts of unstructured data, which can be queried as and when required. In this paper, we come up with a novel methodology to build a data warehouse over big data technologies while specifically addressing the issues of scalability and user performance. Our emphasis is on building a data pipeline which can be used as a reference for future research on the methodologies to build a data warehouse over big data technologies for either structured or unstructured data sources. We have demonstrated the processing of data for retrieving the facts from data warehouse using two techniques, namely HiveQL and MapReduce.","PeriodicalId":299147,"journal":{"name":"2017 2nd International Conference on Communication Systems, Computing and IT Applications (CSCITA)","volume":"310 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 2nd International Conference on Communication Systems, Computing and IT Applications (CSCITA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSCITA.2017.8066553","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Today Big data is one of the most widely spoken about technology that is being explored throughout the world by technology enthusiasts and academic researchers. The reason for this is the enormous data generated every second of each day. Every webpage visited, every text message sent, every post on social networking websites, check-in information, mouse clicks etc. is logged. This data needs to be stored and retrieved efficiently, moreover the data is unstructured therefore the traditional methods of strong data fail. This data needs to be stored and retrieved efficiently There is a need of an efficient, scalable and robust architecture that needs stores enormous amounts of unstructured data, which can be queried as and when required. In this paper, we come up with a novel methodology to build a data warehouse over big data technologies while specifically addressing the issues of scalability and user performance. Our emphasis is on building a data pipeline which can be used as a reference for future research on the methodologies to build a data warehouse over big data technologies for either structured or unstructured data sources. We have demonstrated the processing of data for retrieving the facts from data warehouse using two techniques, namely HiveQL and MapReduce.

查看原文本刊更多论文

比较HiveQL和MapReduce方法处理数据仓库中的事实数据

如今，大数据是世界各地技术爱好者和学术研究人员正在探索的最广泛谈论的技术之一。其原因是每天每秒钟产生的大量数据。访问过的每一个网页、发送的每一条短信、社交网站上的每一篇帖子、签到信息、鼠标点击等都会被记录下来。这种数据需要高效的存储和检索，而且数据是非结构化的，传统的强数据处理方法难以实现。需要高效地存储和检索这些数据，需要一个高效、可扩展和健壮的体系结构，该体系结构需要存储大量的非结构化数据，这些数据可以在需要时进行查询。在本文中，我们提出了一种基于大数据技术构建数据仓库的新方法，同时专门解决了可伸缩性和用户性能问题。我们的重点是构建一个数据管道，它可以作为未来研究基于结构化或非结构化数据源的大数据技术构建数据仓库的方法的参考。我们已经演示了使用两种技术(即HiveQL和MapReduce)从数据仓库检索事实的数据处理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 2nd International Conference on Communication Systems, Computing and IT Applications (CSCITA)

自引率

0.00%

发文量