Effective Construction of Relative Lempel-Ziv Dictionaries

Proceedings of the 25th International Conference on World Wide Web Pub Date : 2016-04-11 DOI:10.1145/2872427.2883042

Kewen Liao, M. Petri, Alistair Moffat, Anthony Wirth

{"title":"Effective Construction of Relative Lempel-Ziv Dictionaries","authors":"Kewen Liao, M. Petri, Alistair Moffat, Anthony Wirth","doi":"10.1145/2872427.2883042","DOIUrl":null,"url":null,"abstract":"Web crawls generate vast quantities of text, retained and archived by the search services that initiate them. To store such data and to allow storage costs to be minimized, while still providing some level of random access to the compressed data, efficient and effective compression techniques are critical. The Relative Lempel Ziv (RLZ) scheme provides fast decompression and retrieval of documents from within large compressed collections, and even with a relatively small RAM-resident dictionary, is competitive relative to adaptive compression schemes. To date, the dictionaries required by RLZ compression have been formed from concatenations of substrings regularly sampled from the underlying document collection, then pruned in a manner that seeks to retain only the high-use sections. In this work, we develop new dictionary design heuristics, based on effective construction, rather than on pruning; we identify dictionary construction as a (string) covering problem. To avoid the complications of string covering algorithms on large collections, we focus on k-mers and their frequencies. First, with a reservoir sampler, we efficiently identify the most common k-mers. Then, since a collection typically comprises regions of local similarity, we select in each \"epoch\" a segment whose k-mers together achieve, locally, the highest coverage score. The dictionary is formed from the concatenation of these epoch-derived segments. Our selection process is inspired by the greedy approach to the Set Cover problem. Compared with the best existing pruning method, CARE, our scheme has a similar construction time, but achieves better compression effectiveness. Over several multi-gigabyte document collections, there are relative gains of up to 27%.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th International Conference on World Wide Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2872427.2883042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

Web crawls generate vast quantities of text, retained and archived by the search services that initiate them. To store such data and to allow storage costs to be minimized, while still providing some level of random access to the compressed data, efficient and effective compression techniques are critical. The Relative Lempel Ziv (RLZ) scheme provides fast decompression and retrieval of documents from within large compressed collections, and even with a relatively small RAM-resident dictionary, is competitive relative to adaptive compression schemes. To date, the dictionaries required by RLZ compression have been formed from concatenations of substrings regularly sampled from the underlying document collection, then pruned in a manner that seeks to retain only the high-use sections. In this work, we develop new dictionary design heuristics, based on effective construction, rather than on pruning; we identify dictionary construction as a (string) covering problem. To avoid the complications of string covering algorithms on large collections, we focus on k-mers and their frequencies. First, with a reservoir sampler, we efficiently identify the most common k-mers. Then, since a collection typically comprises regions of local similarity, we select in each "epoch" a segment whose k-mers together achieve, locally, the highest coverage score. The dictionary is formed from the concatenation of these epoch-derived segments. Our selection process is inspired by the greedy approach to the Set Cover problem. Compared with the best existing pruning method, CARE, our scheme has a similar construction time, but achieves better compression effectiveness. Over several multi-gigabyte document collections, there are relative gains of up to 27%.

查看原文本刊更多论文

相对Lempel-Ziv词典的有效构建

网络爬虫生成大量文本，由发起它们的搜索服务保留和存档。为了存储这样的数据并使存储成本最小化，同时仍然提供对压缩数据的某种程度的随机访问，高效和有效的压缩技术至关重要。相对于自适应压缩方案，相对Lempel Ziv (RLZ)方案提供了对大型压缩集合中的文档的快速解压缩和检索，甚至使用相对较小的ram常驻字典。到目前为止，RLZ压缩所需的字典是由定期从底层文档集合中采样的子字符串的连接形成的，然后以一种只保留高使用部分的方式进行修剪。在这项工作中，我们开发了新的字典设计启发式方法，基于有效构建，而不是基于修剪;我们将字典构造定义为(字符串)覆盖问题。为了避免字符串覆盖算法在大型集合上的复杂性，我们关注k-mers及其频率。首先，使用储层取样器，我们有效地识别了最常见的k-mers。然后，由于一个集合通常包含局部相似的区域，我们在每个“epoch”中选择一个片段，其k-mers在本地的覆盖分数最高。字典是由这些时代派生的片段串联而成的。我们的选择过程的灵感来自于对集合覆盖问题的贪心方法。与现有的最佳修剪方法CARE相比，该方案的施工时间相近，但压缩效果更好。在几个千兆字节的文档集合中，相对收益高达27%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 25th International Conference on World Wide Web

自引率

0.00%

发文量