Answering approximate string queries on large data sets using external memory

2011 IEEE 27th International Conference on Data Engineering Pub Date : 2011-04-11 DOI:10.1109/ICDE.2011.5767856

Alexander Behm, Chen Li, M. Carey

引用次数: 30

Abstract

An approximate string query is to find from a collection of strings those that are similar to a given query string. Answering such queries is important in many applications such as data cleaning and record linkage, where errors could occur in queries as well as the data. Many existing algorithms have focused on in-memory indexes. In this paper we investigate how to efficiently answer such queries in a disk-based setting, by systematically studying the effects of storing data and indexes on disk. We devise a novel physical layout for an inverted index to answer queries and we study how to construct it with limited buffer space. To answer queries, we develop a cost-based, adaptive algorithm that balances the I/O costs of retrieving candidate matches and accessing inverted lists. Experiments on large, real datasets verify that simply adapting existing algorithms to a disk-based setting does not work well and that our new techniques answer queries efficiently. Further, our solutions significantly outperform a recent tree-based index, BED-tree.

查看原文本刊更多论文

使用外部内存回答大型数据集上的近似字符串查询

近似字符串查询是从字符串集合中查找与给定查询字符串相似的字符串。在许多应用程序(如数据清理和记录链接)中，回答此类查询非常重要，在这些应用程序中，查询和数据都可能出现错误。许多现有的算法都集中在内存索引上。在本文中，我们通过系统地研究在磁盘上存储数据和索引的影响，来研究如何在基于磁盘的设置中有效地回答此类查询。我们设计了一种新的倒排索引的物理布局来回答查询，并研究了如何在有限的缓冲空间下构造倒排索引。为了回答查询，我们开发了一种基于成本的自适应算法，以平衡检索候选匹配项和访问倒排列表的I/O成本。在大型真实数据集上的实验证明，简单地将现有算法调整为基于磁盘的设置并不能很好地工作，我们的新技术可以有效地回答查询。此外，我们的解决方案明显优于最近的基于树的指数BED-tree。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE 27th International Conference on Data Engineering

自引率

0.00%

发文量