Effective stratification for low selectivity queries on deep web data sources

Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management Pub Date : 2011-10-24 DOI:10.1145/2063576.2063786

Fan Wang, G. Agrawal

{"title":"Effective stratification for low selectivity queries on deep web data sources","authors":"Fan Wang, G. Agrawal","doi":"10.1145/2063576.2063786","DOIUrl":null,"url":null,"abstract":"We study the problem of estimating the result of an aggregation query with low selectivity when a data source only supports limited data accesses. Existing stratified sampling techniques cannot be applied to such a problem since either it is very hard, if not impossible, to gather certain critical statistics from such a data source, or more importantly, the selective attribute of the query may not be queriable on the data source. In such cases, we need an effective mechanism to stratify the data and form homogeneous strata with respect to the selective attribute of the query, despite not being able to query the data source with the selective attribute.\n This paper presents and evaluates a stratification method for this problem utilizing a queriable auxiliary attribute. The breaking points for the stratification are computed based on a novel Bayesian Adaptive Harmony Search algorithm. This method derives from the existing Harmony search method, but includes novel objective function, and introduces a technique for dynamically adapting key parameters of this method. Our experiments show that the estimation accuracy achieved using our method is consistently higher than 95% even for 0.01% selectivity query, even when there is only a low correlation between the auxiliary attribute and the selective attribute. Furthermore, our method achieves at least a five fold reduction in estimation error over three other methods, for the same sampling cost.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"223 1","pages":"1455-1464"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2063576.2063786","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

We study the problem of estimating the result of an aggregation query with low selectivity when a data source only supports limited data accesses. Existing stratified sampling techniques cannot be applied to such a problem since either it is very hard, if not impossible, to gather certain critical statistics from such a data source, or more importantly, the selective attribute of the query may not be queriable on the data source. In such cases, we need an effective mechanism to stratify the data and form homogeneous strata with respect to the selective attribute of the query, despite not being able to query the data source with the selective attribute. This paper presents and evaluates a stratification method for this problem utilizing a queriable auxiliary attribute. The breaking points for the stratification are computed based on a novel Bayesian Adaptive Harmony Search algorithm. This method derives from the existing Harmony search method, but includes novel objective function, and introduces a technique for dynamically adapting key parameters of this method. Our experiments show that the estimation accuracy achieved using our method is consistently higher than 95% even for 0.01% selectivity query, even when there is only a low correlation between the auxiliary attribute and the selective attribute. Furthermore, our method achieves at least a five fold reduction in estimation error over three other methods, for the same sampling cost.

查看原文本刊更多论文

深层网络数据源低选择性查询的有效分层

研究了当数据源只支持有限的数据访问时，低选择性聚合查询结果的估计问题。现有的分层抽样技术不能应用于这样的问题，因为要么很难(如果不是不可能的话)从这样的数据源收集某些关键统计数据，要么更重要的是，查询的选择性属性可能无法在数据源上查询。在这种情况下，我们需要一种有效的机制来对数据进行分层，并根据查询的选择性属性形成同质层，尽管无法查询具有选择性属性的数据源。本文提出并评价了一种利用可查询辅助属性的分层方法。基于一种新的贝叶斯自适应和谐搜索算法计算分层的断点。该方法在原有和声搜索方法的基础上，增加了新的目标函数，并引入了一种关键参数的动态自适应技术。实验表明，即使在辅助属性和选择性属性之间只有低相关性的情况下，对于0.01%的选择性查询，使用该方法获得的估计精度也始终高于95%。此外，对于相同的采样成本，我们的方法比其他三种方法至少减少了五倍的估计误差。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management

自引率

0.00%

发文量