用于查询和估计的近似数据流:算法和性能评估

Proceedings 18th International Conference on Data Engineering Pub Date : 2002-08-07 DOI:10.1109/ICDE.2002.994775

S. Guha, Nick Koudas

{"title":"用于查询和估计的近似数据流:算法和性能评估","authors":"S. Guha, Nick Koudas","doi":"10.1109/ICDE.2002.994775","DOIUrl":null,"url":null,"abstract":"Obtaining fast and good-quality approximations to data distributions is a problem of central interest to database management. A variety of popular database applications, including approximate querying, similarity searching and data mining in most application domains, rely on such good-quality approximations. Histogram-based approximation is a very popular method in database theory and practice to succinctly represent a data distribution in a space-efficient manner. In this paper, we place the problem of histogram construction into perspective and we generalize it by raising the requirement of a finite data set and/or known data set size. We consider the case of an infinite data set in which data arrive continuously, forming an infinite data stream. In this context, we present single-pass algorithms that are capable of constructing histograms of provable good quality. We present algorithms for the fixed-window variant of the basic histogram construction problem, supporting incremental maintenance of the histograms. The proposed algorithms trade accuracy for speed and allow for a graceful tradeoff between the two, based on application requirements. In the case of approximate queries on infinite data streams, we present a detailed experimental evaluation comparing our algorithms with other applicable techniques using real data sets, demonstrating the superiority of our proposal.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"109","resultStr":"{\"title\":\"Approximating a data stream for querying and estimation: algorithms and performance evaluation\",\"authors\":\"S. Guha, Nick Koudas\",\"doi\":\"10.1109/ICDE.2002.994775\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Obtaining fast and good-quality approximations to data distributions is a problem of central interest to database management. A variety of popular database applications, including approximate querying, similarity searching and data mining in most application domains, rely on such good-quality approximations. Histogram-based approximation is a very popular method in database theory and practice to succinctly represent a data distribution in a space-efficient manner. In this paper, we place the problem of histogram construction into perspective and we generalize it by raising the requirement of a finite data set and/or known data set size. We consider the case of an infinite data set in which data arrive continuously, forming an infinite data stream. In this context, we present single-pass algorithms that are capable of constructing histograms of provable good quality. We present algorithms for the fixed-window variant of the basic histogram construction problem, supporting incremental maintenance of the histograms. The proposed algorithms trade accuracy for speed and allow for a graceful tradeoff between the two, based on application requirements. In the case of approximate queries on infinite data streams, we present a detailed experimental evaluation comparing our algorithms with other applicable techniques using real data sets, demonstrating the superiority of our proposal.\",\"PeriodicalId\":191529,\"journal\":{\"name\":\"Proceedings 18th International Conference on Data Engineering\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2002-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"109\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings 18th International Conference on Data Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDE.2002.994775\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 18th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2002.994775","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 109

摘要

获取快速、高质量的数据分布近似值是数据库管理的核心问题。各种流行的数据库应用程序，包括大多数应用领域中的近似查询、相似度搜索和数据挖掘，都依赖于这种高质量的近似。在数据库理论和实践中，基于直方图的近似是一种非常流行的方法，它以一种节省空间的方式简洁地表示数据分布。在本文中，我们把直方图构造问题的角度，我们提出了一个有限的数据集和/或已知的数据集大小的要求，我们推广它。我们考虑一个无限数据集的情况，其中数据连续到达，形成无限数据流。在这种情况下，我们提出了能够构建可证明的高质量直方图的单遍算法。我们提出了用于基本直方图构建问题的固定窗口变体的算法，支持直方图的增量维护。所提出的算法以精度换取速度，并允许基于应用程序需求在两者之间进行适当的权衡。在无限数据流近似查询的情况下，我们提出了一个详细的实验评估，将我们的算法与使用真实数据集的其他适用技术进行比较，证明了我们建议的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Approximating a data stream for querying and estimation: algorithms and performance evaluation

Obtaining fast and good-quality approximations to data distributions is a problem of central interest to database management. A variety of popular database applications, including approximate querying, similarity searching and data mining in most application domains, rely on such good-quality approximations. Histogram-based approximation is a very popular method in database theory and practice to succinctly represent a data distribution in a space-efficient manner. In this paper, we place the problem of histogram construction into perspective and we generalize it by raising the requirement of a finite data set and/or known data set size. We consider the case of an infinite data set in which data arrive continuously, forming an infinite data stream. In this context, we present single-pass algorithms that are capable of constructing histograms of provable good quality. We present algorithms for the fixed-window variant of the basic histogram construction problem, supporting incremental maintenance of the histograms. The proposed algorithms trade accuracy for speed and allow for a graceful tradeoff between the two, based on application requirements. In the case of approximate queries on infinite data streams, we present a detailed experimental evaluation comparing our algorithms with other applicable techniques using real data sets, demonstrating the superiority of our proposal.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings 18th International Conference on Data Engineering

自引率

0.00%

发文量