{"title":"BASS:在大型字符串数据库上的近似搜索","authors":"Jiong Yang, Wei Wang, Philip S. Yu","doi":"10.1109/SSDBM.2004.20","DOIUrl":null,"url":null,"abstract":"In this paper, we study the problem on how to build an index structure for large string databases to efficiently support various types of string matching without the necessity of mapping the substrings to a numerical space (e.g., string B-tree and MRS-index) nor the restriction of in-memory practice (e.g., suffix tree and suffix array). Towards this goal, we propose a new indexing scheme, BASS-tree, to efficiently support general approximate substring match (in terms of certain symbol substitutions and misalignments) in sublinear time on a large string database. The key idea behind the design is that all positions in each string are grouped recursively into a fully balanced tree according to the similarities of the subsequent segments starting at those positions. Each node is labeled with a regular expression that describes the commonality of the substrings indexed through the subtree. Any search can then be properly directed to the portion in the database with a high potential of matching quickly. With the BASS-tree in place, wild card(s) in the query pattern can also be handled in a seamless way. In addition, search of a long pattern can be decomposed into a series of searches of short segments followed by a process to join the results. It has been demonstrated in our experiments that the potential performance improvement brought by BASS-tree is in an order of magnitude over alternative methods.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"111 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"BASS: approximate search on large string databases\",\"authors\":\"Jiong Yang, Wei Wang, Philip S. Yu\",\"doi\":\"10.1109/SSDBM.2004.20\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we study the problem on how to build an index structure for large string databases to efficiently support various types of string matching without the necessity of mapping the substrings to a numerical space (e.g., string B-tree and MRS-index) nor the restriction of in-memory practice (e.g., suffix tree and suffix array). Towards this goal, we propose a new indexing scheme, BASS-tree, to efficiently support general approximate substring match (in terms of certain symbol substitutions and misalignments) in sublinear time on a large string database. The key idea behind the design is that all positions in each string are grouped recursively into a fully balanced tree according to the similarities of the subsequent segments starting at those positions. Each node is labeled with a regular expression that describes the commonality of the substrings indexed through the subtree. Any search can then be properly directed to the portion in the database with a high potential of matching quickly. With the BASS-tree in place, wild card(s) in the query pattern can also be handled in a seamless way. In addition, search of a long pattern can be decomposed into a series of searches of short segments followed by a process to join the results. It has been demonstrated in our experiments that the potential performance improvement brought by BASS-tree is in an order of magnitude over alternative methods.\",\"PeriodicalId\":383615,\"journal\":{\"name\":\"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.\",\"volume\":\"111 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SSDBM.2004.20\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SSDBM.2004.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
BASS: approximate search on large string databases
In this paper, we study the problem on how to build an index structure for large string databases to efficiently support various types of string matching without the necessity of mapping the substrings to a numerical space (e.g., string B-tree and MRS-index) nor the restriction of in-memory practice (e.g., suffix tree and suffix array). Towards this goal, we propose a new indexing scheme, BASS-tree, to efficiently support general approximate substring match (in terms of certain symbol substitutions and misalignments) in sublinear time on a large string database. The key idea behind the design is that all positions in each string are grouped recursively into a fully balanced tree according to the similarities of the subsequent segments starting at those positions. Each node is labeled with a regular expression that describes the commonality of the substrings indexed through the subtree. Any search can then be properly directed to the portion in the database with a high potential of matching quickly. With the BASS-tree in place, wild card(s) in the query pattern can also be handled in a seamless way. In addition, search of a long pattern can be decomposed into a series of searches of short segments followed by a process to join the results. It has been demonstrated in our experiments that the potential performance improvement brought by BASS-tree is in an order of magnitude over alternative methods.