Beyond simple aggregates: indexing for summary queries

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems Pub Date : 2011-06-13 DOI:10.1145/1989284.1989299

Zhewei Wei, K. Yi

{"title":"Beyond simple aggregates: indexing for summary queries","authors":"Zhewei Wei, K. Yi","doi":"10.1145/1989284.1989299","DOIUrl":null,"url":null,"abstract":"Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as count, sum, average, or max (min), of a particular attribute of these records. Aggregation queries are especially useful in business intelligence and data analysis applications where users are interested not in the actual records, but some statistics of them. They can also be executed much more efficiently than reporting queries, by embedding properly precomputed aggregates into an index.\n However, reporting and aggregation queries provide only two extremes for exploring the data. Data analysts often need more insight into the data distribution than what those simple aggregates provide, and yet certainly do not want the sheer volume of data returned by reporting queries. In this paper, we design indexing techniques that allow for extracting a statistical summary of all the records in the query. The summaries we support include frequent items, quantiles, various sketches, and wavelets, all of which are of central importance in massive data analysis. Our indexes require linear space and extract a summary with the optimal or near-optimal query cost.","PeriodicalId":92118,"journal":{"name":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","volume":"37 1","pages":"117-128"},"PeriodicalIF":0.0000,"publicationDate":"2011-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1989284.1989299","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as count, sum, average, or max (min), of a particular attribute of these records. Aggregation queries are especially useful in business intelligence and data analysis applications where users are interested not in the actual records, but some statistics of them. They can also be executed much more efficiently than reporting queries, by embedding properly precomputed aggregates into an index. However, reporting and aggregation queries provide only two extremes for exploring the data. Data analysts often need more insight into the data distribution than what those simple aggregates provide, and yet certainly do not want the sheer volume of data returned by reporting queries. In this paper, we design indexing techniques that allow for extracting a statistical summary of all the records in the query. The summaries we support include frequent items, quantiles, various sketches, and wavelets, all of which are of central importance in massive data analysis. Our indexes require linear space and extract a summary with the optimal or near-optimal query cost.

查看原文本刊更多论文

除了简单的聚合之外:为摘要查询建立索引

数据库查询可以大致分为两类:报告查询和聚合查询。前者从数据库中检索符合查询条件的记录集合，而后者返回这些记录的特定属性的集合，如count、sum、average或max (min)。聚合查询在商业智能和数据分析应用程序中特别有用，在这些应用程序中，用户感兴趣的不是实际记录，而是它们的一些统计数据。通过将预先计算好的聚合嵌入到索引中，它们的执行效率也比报告查询高得多。然而，报告和聚合查询只提供了两种极端的数据探索方式。数据分析师通常需要更深入地了解数据分布，而不是那些简单的聚合所提供的，当然也不希望通过报告查询返回大量的数据。在本文中，我们设计了索引技术，允许提取查询中所有记录的统计摘要。我们支持的摘要包括频繁项、分位数、各种草图和小波，所有这些在海量数据分析中都是至关重要的。我们的索引需要线性空间，并以最优或接近最优的查询成本提取摘要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems

CiteScore

4.40

自引率

0.00%

发文量