{"title":"A three-dimensional data model in HBase for large time-series dataset analysis","authors":"Dan Han, Eleni Stroulia","doi":"10.1109/MESOCA.2012.6392598","DOIUrl":null,"url":null,"abstract":"In the transition of applications from the traditional enterprise infrastructures to cloud infrastructures, scalable database management system plays an important role in efficiently managing and analysing unprecedented massive amount of data. Compared to RDBMSs, NoSQL databases, are more attractive in addressing this challenge. However, it is not easy to manage data in NoSQL database effectively for non-expert users because of the rare data-organization support. A poor data organization may accidentally abuse the features of NoSQL database and achieve unsatisfactory performance. Therefore, a systematic method for NoSQL database data-schema design is a timely and important problem for researchers and practitioners. HBase, as a particular NoSQL database offering, relies (a) on HDFS, for its distributed and replicated storage, and (b) on coprocessors, for efficient parallel query processing. To harness the potential parallelism benefits, an appropriate partitioning of the data across the HBase storage is required. we investigate the effectiveness of the three-dimensional data model, which uses the “version” dimension of HBase to store the values of a data item over time. We have experimented and evaluated the performance impact of this type of data model with two data sets, of different sizes and different time lengths. For each of these data sets, we have compared the performance of several ad-hoc queries, implemented with HBase Coprocessors framework, across different data schemas, some of which (do not) use the third HBase dimension. The experiment results demonstrate improved performance with the data schemas that use the third dimension of HBase.","PeriodicalId":355118,"journal":{"name":"2012 IEEE 6th International Workshop on the Maintenance and Evolution of Service-Oriented and Cloud-Based Systems (MESOCA)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 6th International Workshop on the Maintenance and Evolution of Service-Oriented and Cloud-Based Systems (MESOCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MESOCA.2012.6392598","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 33
Abstract
In the transition of applications from the traditional enterprise infrastructures to cloud infrastructures, scalable database management system plays an important role in efficiently managing and analysing unprecedented massive amount of data. Compared to RDBMSs, NoSQL databases, are more attractive in addressing this challenge. However, it is not easy to manage data in NoSQL database effectively for non-expert users because of the rare data-organization support. A poor data organization may accidentally abuse the features of NoSQL database and achieve unsatisfactory performance. Therefore, a systematic method for NoSQL database data-schema design is a timely and important problem for researchers and practitioners. HBase, as a particular NoSQL database offering, relies (a) on HDFS, for its distributed and replicated storage, and (b) on coprocessors, for efficient parallel query processing. To harness the potential parallelism benefits, an appropriate partitioning of the data across the HBase storage is required. we investigate the effectiveness of the three-dimensional data model, which uses the “version” dimension of HBase to store the values of a data item over time. We have experimented and evaluated the performance impact of this type of data model with two data sets, of different sizes and different time lengths. For each of these data sets, we have compared the performance of several ad-hoc queries, implemented with HBase Coprocessors framework, across different data schemas, some of which (do not) use the third HBase dimension. The experiment results demonstrate improved performance with the data schemas that use the third dimension of HBase.