A three-dimensional data model in HBase for large time-series dataset analysis

2012 IEEE 6th International Workshop on the Maintenance and Evolution of Service-Oriented and Cloud-Based Systems (MESOCA) Pub Date : 2012-12-24 DOI:10.1109/MESOCA.2012.6392598

Dan Han, Eleni Stroulia

{"title":"A three-dimensional data model in HBase for large time-series dataset analysis","authors":"Dan Han, Eleni Stroulia","doi":"10.1109/MESOCA.2012.6392598","DOIUrl":null,"url":null,"abstract":"In the transition of applications from the traditional enterprise infrastructures to cloud infrastructures, scalable database management system plays an important role in efficiently managing and analysing unprecedented massive amount of data. Compared to RDBMSs, NoSQL databases, are more attractive in addressing this challenge. However, it is not easy to manage data in NoSQL database effectively for non-expert users because of the rare data-organization support. A poor data organization may accidentally abuse the features of NoSQL database and achieve unsatisfactory performance. Therefore, a systematic method for NoSQL database data-schema design is a timely and important problem for researchers and practitioners. HBase, as a particular NoSQL database offering, relies (a) on HDFS, for its distributed and replicated storage, and (b) on coprocessors, for efficient parallel query processing. To harness the potential parallelism benefits, an appropriate partitioning of the data across the HBase storage is required. we investigate the effectiveness of the three-dimensional data model, which uses the “version” dimension of HBase to store the values of a data item over time. We have experimented and evaluated the performance impact of this type of data model with two data sets, of different sizes and different time lengths. For each of these data sets, we have compared the performance of several ad-hoc queries, implemented with HBase Coprocessors framework, across different data schemas, some of which (do not) use the third HBase dimension. The experiment results demonstrate improved performance with the data schemas that use the third dimension of HBase.","PeriodicalId":355118,"journal":{"name":"2012 IEEE 6th International Workshop on the Maintenance and Evolution of Service-Oriented and Cloud-Based Systems (MESOCA)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 6th International Workshop on the Maintenance and Evolution of Service-Oriented and Cloud-Based Systems (MESOCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MESOCA.2012.6392598","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 33

Abstract

In the transition of applications from the traditional enterprise infrastructures to cloud infrastructures, scalable database management system plays an important role in efficiently managing and analysing unprecedented massive amount of data. Compared to RDBMSs, NoSQL databases, are more attractive in addressing this challenge. However, it is not easy to manage data in NoSQL database effectively for non-expert users because of the rare data-organization support. A poor data organization may accidentally abuse the features of NoSQL database and achieve unsatisfactory performance. Therefore, a systematic method for NoSQL database data-schema design is a timely and important problem for researchers and practitioners. HBase, as a particular NoSQL database offering, relies (a) on HDFS, for its distributed and replicated storage, and (b) on coprocessors, for efficient parallel query processing. To harness the potential parallelism benefits, an appropriate partitioning of the data across the HBase storage is required. we investigate the effectiveness of the three-dimensional data model, which uses the “version” dimension of HBase to store the values of a data item over time. We have experimented and evaluated the performance impact of this type of data model with two data sets, of different sizes and different time lengths. For each of these data sets, we have compared the performance of several ad-hoc queries, implemented with HBase Coprocessors framework, across different data schemas, some of which (do not) use the third HBase dimension. The experiment results demonstrate improved performance with the data schemas that use the third dimension of HBase.

查看原文本刊更多论文

HBase中的三维数据模型，用于大型时间序列数据集分析

在应用从传统企业基础设施向云基础设施过渡的过程中，可扩展的数据库管理系统在高效管理和分析前所未有的海量数据方面发挥着重要作用。与rdbms相比，NoSQL数据库在解决这一挑战方面更具吸引力。然而，由于缺乏对数据组织的支持，对于非专业用户来说，有效管理NoSQL数据库中的数据并不容易。一个糟糕的数据组织可能会无意中滥用NoSQL数据库的特性，从而获得不理想的性能。因此，一个系统的NoSQL数据库数据模式设计方法是研究人员和实践者迫切需要解决的重要问题。HBase作为一个特殊的NoSQL数据库产品，依赖于(a) HDFS的分布式和复制存储，以及(b)协处理器的高效并行查询处理。为了利用潜在的并行性优势，需要在HBase存储中对数据进行适当的分区。我们研究了三维数据模型的有效性，该模型使用HBase的“版本”维度来存储数据项随时间变化的值。我们已经用两个不同大小和不同时间长度的数据集试验并评估了这种类型的数据模型对性能的影响。对于每一个数据集，我们比较了几个特设查询的性能，这些查询是用HBase协处理器框架实现的，跨越不同的数据模式，其中一些(不)使用HBase的第三维度。实验结果表明，使用HBase的三维数据模式可以提高性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE 6th International Workshop on the Maintenance and Evolution of Service-Oriented and Cloud-Based Systems (MESOCA)

自引率

0.00%

发文量