A novel approach for approximate aggregations over arrays

Proceedings of the 27th International Conference on Scientific and Statistical Database Management Pub Date : 2015-06-29 DOI:10.1145/2791347.2791349

Yi Wang, Yunde Su, G. Agrawal

{"title":"A novel approach for approximate aggregations over arrays","authors":"Yi Wang, Yunde Su, G. Agrawal","doi":"10.1145/2791347.2791349","DOIUrl":null,"url":null,"abstract":"Approximate aggregation has been a popular approach for interactive data analysis and decision making, especially on large-scale datasets. While there is clearly a need to apply this approach for scientific datasets comprising massive arrays, existing algorithms have largely been developed for relational data, and cannot handle both dimension-based and value-based predicates efficiently while maintaining accuracy. In this paper, we present a novel approach for approximate aggregations over array data, using bitmap indices or bitvectors as the summary structure, as they preserve both spatial and value distribution of the data. We develop approximate aggregation algorithms using only the bitvectors and certain additional pre-aggregation statistics (equivalent to a 1-dimensional histogram) that we require. Another key development is choosing a binning strategy that can improve aggregation accuracy -- we introduce a v-optimized binning strategy and its weighted extension, and present a bitmap construction algorithm with such binning. We compare our method with other existing methods including sampling and multi-dimensional histograms, as well as the use of other binning strategies with bitmaps. We demonstrate both high accuracy and efficiency of our approach. Specifically, we show that in most cases, our method is more accurate than other methods by at least one order of magnitude. Despite achieving much higher accuracy, our method can require significantly less storage than multi-dimensional histograms.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"89","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2791347.2791349","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 89

Abstract

Approximate aggregation has been a popular approach for interactive data analysis and decision making, especially on large-scale datasets. While there is clearly a need to apply this approach for scientific datasets comprising massive arrays, existing algorithms have largely been developed for relational data, and cannot handle both dimension-based and value-based predicates efficiently while maintaining accuracy. In this paper, we present a novel approach for approximate aggregations over array data, using bitmap indices or bitvectors as the summary structure, as they preserve both spatial and value distribution of the data. We develop approximate aggregation algorithms using only the bitvectors and certain additional pre-aggregation statistics (equivalent to a 1-dimensional histogram) that we require. Another key development is choosing a binning strategy that can improve aggregation accuracy -- we introduce a v-optimized binning strategy and its weighted extension, and present a bitmap construction algorithm with such binning. We compare our method with other existing methods including sampling and multi-dimensional histograms, as well as the use of other binning strategies with bitmaps. We demonstrate both high accuracy and efficiency of our approach. Specifically, we show that in most cases, our method is more accurate than other methods by at least one order of magnitude. Despite achieving much higher accuracy, our method can require significantly less storage than multi-dimensional histograms.

查看原文本刊更多论文

阵列近似聚合的一种新方法

近似聚合已成为交互式数据分析和决策的一种流行方法，特别是在大规模数据集上。虽然显然有必要将这种方法应用于包含大量数组的科学数据集，但现有的算法主要是为关系数据开发的，不能在保持准确性的同时有效地处理基于维和基于值的谓词。在本文中，我们提出了一种新的阵列数据近似聚合方法，使用位图索引或位向量作为汇总结构，因为它们同时保留了数据的空间分布和值分布。我们只使用我们需要的位向量和某些额外的预聚合统计(相当于一维直方图)来开发近似聚合算法。另一个关键的发展是选择一种可以提高聚合精度的分箱策略——我们引入了一种v优化的分箱策略及其加权扩展，并提出了一种使用这种分箱的位图构建算法。我们将我们的方法与其他现有的方法进行了比较，包括采样和多维直方图，以及其他使用位图的分箱策略。我们证明了我们的方法具有很高的准确性和效率。具体来说，我们表明，在大多数情况下，我们的方法比其他方法至少准确一个数量级。尽管实现了更高的精度，但我们的方法比多维直方图需要的存储空间要少得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 27th International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量