精确子串匹配的空间经济部分克索引

Proceedings of the 18th ACM conference on Information and knowledge management Pub Date : 2009-11-02 DOI:10.1145/1645953.1645992

N. Tang, Lefteris Sidirourgos, P. Boncz

{"title":"精确子串匹配的空间经济部分克索引","authors":"N. Tang, Lefteris Sidirourgos, P. Boncz","doi":"10.1145/1645953.1645992","DOIUrl":null,"url":null,"abstract":"Exact substring matching queries on large data collections can be answered using q-gram indices, that store for each occurring q-byte pattern an (ordered) posting list with the positions of all occurrences. Such gram indices are known to provide fast query response time and to allow the index to be created quickly even on huge disk-based datasets. Their main drawback is relatively large storage space, that is a constant multiple (typically >2) of the original data size, even when compression is used. In this work, we study methods to conserve the scalable creation time and efficient exact substring query properties of gram indices, while reducing storage space. To this end, we first propose a partial gram index based on a reduction from the problem of omitting indexed q-grams to the set cover problem. While this method is successful in reducing the size of the index, it generates false positives at query time, reducing efficiency. We then increase the accuracy of partial grams by splitting posting lists of frequent grams in a frequency-tuned set of signatures that take the bytes surrounding the grams into account. The resulting qs-gram scheme is tested on huge collections (up to 426GB) and is shown to achieve an almost 1:1 data:index size, and query performance even faster than normal gram methods, thanks to the reduced size and access cost.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Space-economical partial gram indices for exact substring matching\",\"authors\":\"N. Tang, Lefteris Sidirourgos, P. Boncz\",\"doi\":\"10.1145/1645953.1645992\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Exact substring matching queries on large data collections can be answered using q-gram indices, that store for each occurring q-byte pattern an (ordered) posting list with the positions of all occurrences. Such gram indices are known to provide fast query response time and to allow the index to be created quickly even on huge disk-based datasets. Their main drawback is relatively large storage space, that is a constant multiple (typically >2) of the original data size, even when compression is used. In this work, we study methods to conserve the scalable creation time and efficient exact substring query properties of gram indices, while reducing storage space. To this end, we first propose a partial gram index based on a reduction from the problem of omitting indexed q-grams to the set cover problem. While this method is successful in reducing the size of the index, it generates false positives at query time, reducing efficiency. We then increase the accuracy of partial grams by splitting posting lists of frequent grams in a frequency-tuned set of signatures that take the bytes surrounding the grams into account. The resulting qs-gram scheme is tested on huge collections (up to 426GB) and is shown to achieve an almost 1:1 data:index size, and query performance even faster than normal gram methods, thanks to the reduced size and access cost.\",\"PeriodicalId\":286251,\"journal\":{\"name\":\"Proceedings of the 18th ACM conference on Information and knowledge management\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-11-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 18th ACM conference on Information and knowledge management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1645953.1645992\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th ACM conference on Information and knowledge management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1645953.1645992","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

大型数据集合上的精确子字符串匹配查询可以使用q-gram索引来回答，它为每个出现的q-byte模式存储一个包含所有出现位置的(有序)发布列表。众所周知，这样的gram索引可以提供快速的查询响应时间，并且即使在基于磁盘的庞大数据集上也可以快速创建索引。它们的主要缺点是相对较大的存储空间，这是原始数据大小的常数倍(通常>2)，即使使用压缩也是如此。在这项工作中，我们研究了在减少存储空间的同时，节省可扩展的创建时间和有效的精确子字符串查询属性的方法。为此，我们首先提出了一个基于将省略索引q-g问题简化为集合覆盖问题的部分克指数。虽然这种方法成功地减少了索引的大小，但它在查询时产生误报，从而降低了效率。然后，我们通过在一组频率调优的签名中拆分频繁grams的发布列表来提高部分grams的准确性，这些签名考虑了gram周围的字节。由此产生的q -gram方案在大型集合(高达426GB)上进行了测试，结果显示，由于减少了大小和访问成本，它几乎可以实现1:1的数据:索引大小和查询性能，甚至比普通的gram方法更快。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Space-economical partial gram indices for exact substring matching

Exact substring matching queries on large data collections can be answered using q-gram indices, that store for each occurring q-byte pattern an (ordered) posting list with the positions of all occurrences. Such gram indices are known to provide fast query response time and to allow the index to be created quickly even on huge disk-based datasets. Their main drawback is relatively large storage space, that is a constant multiple (typically >2) of the original data size, even when compression is used. In this work, we study methods to conserve the scalable creation time and efficient exact substring query properties of gram indices, while reducing storage space. To this end, we first propose a partial gram index based on a reduction from the problem of omitting indexed q-grams to the set cover problem. While this method is successful in reducing the size of the index, it generates false positives at query time, reducing efficiency. We then increase the accuracy of partial grams by splitting posting lists of frequent grams in a frequency-tuned set of signatures that take the bytes surrounding the grams into account. The resulting qs-gram scheme is tested on huge collections (up to 426GB) and is shown to achieve an almost 1:1 data:index size, and query performance even faster than normal gram methods, thanks to the reduced size and access cost.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 18th ACM conference on Information and knowledge management

自引率

0.00%

发文量