BASS:在大型字符串数据库上的近似搜索

Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004. Pub Date : 2004-06-21 DOI:10.1109/SSDBM.2004.20

Jiong Yang, Wei Wang, Philip S. Yu

{"title":"BASS:在大型字符串数据库上的近似搜索","authors":"Jiong Yang, Wei Wang, Philip S. Yu","doi":"10.1109/SSDBM.2004.20","DOIUrl":null,"url":null,"abstract":"In this paper, we study the problem on how to build an index structure for large string databases to efficiently support various types of string matching without the necessity of mapping the substrings to a numerical space (e.g., string B-tree and MRS-index) nor the restriction of in-memory practice (e.g., suffix tree and suffix array). Towards this goal, we propose a new indexing scheme, BASS-tree, to efficiently support general approximate substring match (in terms of certain symbol substitutions and misalignments) in sublinear time on a large string database. The key idea behind the design is that all positions in each string are grouped recursively into a fully balanced tree according to the similarities of the subsequent segments starting at those positions. Each node is labeled with a regular expression that describes the commonality of the substrings indexed through the subtree. Any search can then be properly directed to the portion in the database with a high potential of matching quickly. With the BASS-tree in place, wild card(s) in the query pattern can also be handled in a seamless way. In addition, search of a long pattern can be decomposed into a series of searches of short segments followed by a process to join the results. It has been demonstrated in our experiments that the potential performance improvement brought by BASS-tree is in an order of magnitude over alternative methods.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"111 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"BASS: approximate search on large string databases\",\"authors\":\"Jiong Yang, Wei Wang, Philip S. Yu\",\"doi\":\"10.1109/SSDBM.2004.20\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we study the problem on how to build an index structure for large string databases to efficiently support various types of string matching without the necessity of mapping the substrings to a numerical space (e.g., string B-tree and MRS-index) nor the restriction of in-memory practice (e.g., suffix tree and suffix array). Towards this goal, we propose a new indexing scheme, BASS-tree, to efficiently support general approximate substring match (in terms of certain symbol substitutions and misalignments) in sublinear time on a large string database. The key idea behind the design is that all positions in each string are grouped recursively into a fully balanced tree according to the similarities of the subsequent segments starting at those positions. Each node is labeled with a regular expression that describes the commonality of the substrings indexed through the subtree. Any search can then be properly directed to the portion in the database with a high potential of matching quickly. With the BASS-tree in place, wild card(s) in the query pattern can also be handled in a seamless way. In addition, search of a long pattern can be decomposed into a series of searches of short segments followed by a process to join the results. It has been demonstrated in our experiments that the potential performance improvement brought by BASS-tree is in an order of magnitude over alternative methods.\",\"PeriodicalId\":383615,\"journal\":{\"name\":\"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.\",\"volume\":\"111 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SSDBM.2004.20\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SSDBM.2004.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

本文研究了如何在不需要将子字符串映射到数值空间(如字符串B-tree和MRS-index)和不受内存实践(如后缀树和后缀数组)限制的情况下，为大型字符串数据库构建索引结构，从而有效地支持各种类型的字符串匹配。为了实现这一目标，我们提出了一种新的索引方案，BASS-tree，以在亚线性时间内有效地支持大型字符串数据库的一般近似子字符串匹配(在某些符号替换和不对齐方面)。设计背后的关键思想是，每个字符串中的所有位置都根据从这些位置开始的后续片段的相似性递归地分组到一个完全平衡的树中。每个节点都用正则表达式标记，正则表达式描述了通过子树索引的子字符串的共性。然后，任何搜索都可以正确地定向到数据库中具有高快速匹配潜力的部分。有了bass树，查询模式中的通配符也可以无缝地处理。此外，对长模式的搜索可以分解为对短段的一系列搜索，然后通过一个过程将结果连接起来。在我们的实验中已经证明，BASS-tree带来的潜在性能改进比其他方法高出一个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

BASS: approximate search on large string databases

In this paper, we study the problem on how to build an index structure for large string databases to efficiently support various types of string matching without the necessity of mapping the substrings to a numerical space (e.g., string B-tree and MRS-index) nor the restriction of in-memory practice (e.g., suffix tree and suffix array). Towards this goal, we propose a new indexing scheme, BASS-tree, to efficiently support general approximate substring match (in terms of certain symbol substitutions and misalignments) in sublinear time on a large string database. The key idea behind the design is that all positions in each string are grouped recursively into a fully balanced tree according to the similarities of the subsequent segments starting at those positions. Each node is labeled with a regular expression that describes the commonality of the substrings indexed through the subtree. Any search can then be properly directed to the portion in the database with a high potential of matching quickly. With the BASS-tree in place, wild card(s) in the query pattern can also be handled in a seamless way. In addition, search of a long pattern can be decomposed into a series of searches of short segments followed by a process to join the results. It has been demonstrated in our experiments that the potential performance improvement brought by BASS-tree is in an order of magnitude over alternative methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.

自引率

0.00%

发文量