Beyond Hit-or-Miss: A Comparative Study of Synopses for Similarity Searching

M. Bedo, Daniel de Oliveira, A. Traina, C. Traina
{"title":"Beyond Hit-or-Miss: A Comparative Study of Synopses for Similarity Searching","authors":"M. Bedo, Daniel de Oliveira, A. Traina, C. Traina","doi":"10.5753/jidm.2018.1635","DOIUrl":null,"url":null,"abstract":"A DBMS optimizer module takes its decisions by modeling the query costs upon the distribution of the data space. Cost modeling of similarity queries, however, requires the representation of distances’ rather than data distributions. Therefore, the finding of a suitable representation (or synopsis) for the distance distribution has a major impact in the optimization of similarity searches. In this study, we evaluate the quality of estimates drawn from five synopses of distinct paradigms regarding two common query criteria. Moreover, we embed the synopses into a new parametric cost model, called Stockpile, for the cost estimation of similarity queries on metric trees. The model uses the synopses estimation for calculating the probability of traversing a metric tree node, which defines the expected number of both disk accesses (I/O costs) and distance calculations (CPU costs). We performed an extensive set of experiments on real-world data sources regarding the estimates of each synopsis (and its parametric variations) by using paired ranking tests. In global terms, three synopses have outperformed their competitors regarding selectivity estimation, whereas two of them have also surpassed the others in the prediction of both I/O and CPU costs with respect to Stockpile model predictions. Additionally, results also indicate the choice of the most suitable synopsis may depend on characteristics of the distance distribution.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"42 8","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Inf. Data Manag.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/jidm.2018.1635","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

A DBMS optimizer module takes its decisions by modeling the query costs upon the distribution of the data space. Cost modeling of similarity queries, however, requires the representation of distances’ rather than data distributions. Therefore, the finding of a suitable representation (or synopsis) for the distance distribution has a major impact in the optimization of similarity searches. In this study, we evaluate the quality of estimates drawn from five synopses of distinct paradigms regarding two common query criteria. Moreover, we embed the synopses into a new parametric cost model, called Stockpile, for the cost estimation of similarity queries on metric trees. The model uses the synopses estimation for calculating the probability of traversing a metric tree node, which defines the expected number of both disk accesses (I/O costs) and distance calculations (CPU costs). We performed an extensive set of experiments on real-world data sources regarding the estimates of each synopsis (and its parametric variations) by using paired ranking tests. In global terms, three synopses have outperformed their competitors regarding selectivity estimation, whereas two of them have also surpassed the others in the prediction of both I/O and CPU costs with respect to Stockpile model predictions. Additionally, results also indicate the choice of the most suitable synopsis may depend on characteristics of the distance distribution.
超越偶然性:相似检索概要的比较研究
DBMS优化器模块根据数据空间的分布对查询成本进行建模,从而做出决策。然而,相似查询的成本建模需要表示距离而不是数据分布。因此,寻找合适的距离分布表示(或概要)对相似性搜索的优化具有重要影响。在这项研究中,我们评估了从关于两个常见查询标准的不同范式的五个概要得出的估计的质量。此外,我们将概要嵌入到一个新的参数成本模型中,称为库存,用于度量树上相似性查询的成本估计。该模型使用概要估计来计算遍历度量树节点的概率,该节点定义了磁盘访问(I/O成本)和距离计算(CPU成本)的预期次数。我们在真实世界的数据源上进行了一组广泛的实验,通过使用配对排序测试来估计每个概要(及其参数变化)。在全球范围内,三个概要在选择性估计方面优于其竞争对手,而其中两个在关于库存模型预测的I/O和CPU成本预测方面也优于其他概要。此外,结果还表明,选择最合适的概要可能取决于距离分布的特征。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信