Asymmetric signature schemes for efficient exact edit similarity query processing

IF 2.2 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems Pub Date : 2013-08-01 DOI:10.1145/2508020.2508023

Jianbin Qin, Wei Wang, Chuan Xiao, Yifei Lu, Xuemin Lin, Haixun Wang

{"title":"Asymmetric signature schemes for efficient exact edit similarity query processing","authors":"Jianbin Qin, Wei Wang, Chuan Xiao, Yifei Lu, Xuemin Lin, Haixun Wang","doi":"10.1145/2508020.2508023","DOIUrl":null,"url":null,"abstract":"Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold τ. Most existing methods answering edit similarity queries employ schemes to generate string subsequences as signatures and generate candidates by set overlap queries on query and data signatures.\n In this article, we show that for any such signature scheme, the lower bound of the minimum number of signatures is τ + 1, which is lower than what is achieved by existing methods. We then propose several asymmetric signature schemes, that is, extracting different numbers of signatures for the data and query strings, which achieve this lower bound. A basic asymmetric scheme is first established on the basis of matching q-chunks and q-grams between two strings. Two efficient query processing algorithms (IndexGram and IndexChunk) are developed on top of this scheme. We also propose novel candidate pruning methods to further improve the efficiency. We then generalize the basic scheme by incorporating novel ideas of floating q-chunks, optimal selection of q-chunks, and reducing the number of signatures using global ordering. As a result, the Super and Turbo families of schemes are developed together with their corresponding query processing algorithms. We have conducted a comprehensive experimental study using the six asymmetric algorithms and nine previous state-of-the-art algorithms. The experiment results clearly showcase the efficiency of our methods and demonstrate space and time characteristics of our proposed algorithms.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":null,"pages":null},"PeriodicalIF":2.2000,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/2508020.2508023","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 19

Abstract

Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold τ. Most existing methods answering edit similarity queries employ schemes to generate string subsequences as signatures and generate candidates by set overlap queries on query and data signatures. In this article, we show that for any such signature scheme, the lower bound of the minimum number of signatures is τ + 1, which is lower than what is achieved by existing methods. We then propose several asymmetric signature schemes, that is, extracting different numbers of signatures for the data and query strings, which achieve this lower bound. A basic asymmetric scheme is first established on the basis of matching q-chunks and q-grams between two strings. Two efficient query processing algorithms (IndexGram and IndexChunk) are developed on top of this scheme. We also propose novel candidate pruning methods to further improve the efficiency. We then generalize the basic scheme by incorporating novel ideas of floating q-chunks, optimal selection of q-chunks, and reducing the number of signatures using global ordering. As a result, the Super and Turbo families of schemes are developed together with their corresponding query processing algorithms. We have conducted a comprehensive experimental study using the six asymmetric algorithms and nine previous state-of-the-art algorithms. The experiment results clearly showcase the efficiency of our methods and demonstrate space and time characteristics of our proposed algorithms.

查看原文本刊更多论文

用于高效精确编辑相似度查询处理的非对称签名方案

给定一个查询字符串Q，编辑相似度搜索在数据库中找到与Q的编辑距离不超过给定阈值τ的所有字符串。大多数现有的编辑相似度查询方法采用生成字符串子序列作为签名的方案，并通过在查询和数据签名上设置重叠查询来生成候选对象。在本文中，我们证明了对于任何这样的签名方案，最小签名数的下界是τ + 1，这比现有方法的结果要低。然后，我们提出了几种非对称签名方案，即为数据和查询字符串提取不同数量的签名，从而实现了这个下界。首先在两个字符串之间匹配q块和q克的基础上建立了一个基本的不对称方案。在此基础上开发了两种高效的查询处理算法(IndexGram和IndexChunk)。我们还提出了新的候选剪枝方法来进一步提高剪枝效率。然后，我们通过引入浮动q块、q块的最优选择和使用全局排序减少签名数量的新思想来推广基本方案。因此，开发了Super和Turbo方案族及其相应的查询处理算法。我们使用六种非对称算法和九种先前最先进的算法进行了全面的实验研究。实验结果清楚地显示了我们的方法的有效性，并展示了我们提出的算法的空间和时间特性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Database Systems 工程技术-计算机：软件工程

CiteScore

5.60

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.