Beyond Hit-or-Miss: A Comparative Study of Synopses for Similarity Searching

J. Inf. Data Manag. Pub Date : 2018-06-20 DOI:10.5753/jidm.2018.1635

M. Bedo, Daniel de Oliveira, A. Traina, C. Traina

{"title":"Beyond Hit-or-Miss: A Comparative Study of Synopses for Similarity Searching","authors":"M. Bedo, Daniel de Oliveira, A. Traina, C. Traina","doi":"10.5753/jidm.2018.1635","DOIUrl":null,"url":null,"abstract":"A DBMS optimizer module takes its decisions by modeling the query costs upon the distribution of the data space. Cost modeling of similarity queries, however, requires the representation of distances’ rather than data distributions. Therefore, the finding of a suitable representation (or synopsis) for the distance distribution has a major impact in the optimization of similarity searches. In this study, we evaluate the quality of estimates drawn from five synopses of distinct paradigms regarding two common query criteria. Moreover, we embed the synopses into a new parametric cost model, called Stockpile, for the cost estimation of similarity queries on metric trees. The model uses the synopses estimation for calculating the probability of traversing a metric tree node, which defines the expected number of both disk accesses (I/O costs) and distance calculations (CPU costs). We performed an extensive set of experiments on real-world data sources regarding the estimates of each synopsis (and its parametric variations) by using paired ranking tests. In global terms, three synopses have outperformed their competitors regarding selectivity estimation, whereas two of them have also surpassed the others in the prediction of both I/O and CPU costs with respect to Stockpile model predictions. Additionally, results also indicate the choice of the most suitable synopsis may depend on characteristics of the distance distribution.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"42 8","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Inf. Data Manag.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/jidm.2018.1635","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

A DBMS optimizer module takes its decisions by modeling the query costs upon the distribution of the data space. Cost modeling of similarity queries, however, requires the representation of distances’ rather than data distributions. Therefore, the finding of a suitable representation (or synopsis) for the distance distribution has a major impact in the optimization of similarity searches. In this study, we evaluate the quality of estimates drawn from five synopses of distinct paradigms regarding two common query criteria. Moreover, we embed the synopses into a new parametric cost model, called Stockpile, for the cost estimation of similarity queries on metric trees. The model uses the synopses estimation for calculating the probability of traversing a metric tree node, which defines the expected number of both disk accesses (I/O costs) and distance calculations (CPU costs). We performed an extensive set of experiments on real-world data sources regarding the estimates of each synopsis (and its parametric variations) by using paired ranking tests. In global terms, three synopses have outperformed their competitors regarding selectivity estimation, whereas two of them have also surpassed the others in the prediction of both I/O and CPU costs with respect to Stockpile model predictions. Additionally, results also indicate the choice of the most suitable synopsis may depend on characteristics of the distance distribution.

查看原文本刊更多论文

超越偶然性:相似检索概要的比较研究

DBMS优化器模块根据数据空间的分布对查询成本进行建模，从而做出决策。然而，相似查询的成本建模需要表示距离而不是数据分布。因此，寻找合适的距离分布表示(或概要)对相似性搜索的优化具有重要影响。在这项研究中，我们评估了从关于两个常见查询标准的不同范式的五个概要得出的估计的质量。此外，我们将概要嵌入到一个新的参数成本模型中，称为库存，用于度量树上相似性查询的成本估计。该模型使用概要估计来计算遍历度量树节点的概率，该节点定义了磁盘访问(I/O成本)和距离计算(CPU成本)的预期次数。我们在真实世界的数据源上进行了一组广泛的实验，通过使用配对排序测试来估计每个概要(及其参数变化)。在全球范围内，三个概要在选择性估计方面优于其竞争对手，而其中两个在关于库存模型预测的I/O和CPU成本预测方面也优于其他概要。此外，结果还表明，选择最合适的概要可能取决于距离分布的特征。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

J. Inf. Data Manag.

自引率

0.00%

发文量