Hierarchical Relative Lempel-Ziv Compression

Bulletin of the Society of Sea Water Science, Japan Pub Date : 2022-08-24 DOI:10.48550/arXiv.2208.11371

P. Bille, I. L. Gørtz, S. Puglisi, Simon R. Tarnow

{"title":"Hierarchical Relative Lempel-Ziv Compression","authors":"P. Bille, I. L. Gørtz, S. Puglisi, Simon R. Tarnow","doi":"10.48550/arXiv.2208.11371","DOIUrl":null,"url":null,"abstract":"Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in which a string $S$ is compressed relative to a second string $R$ (called the reference) by parsing $S$ into a sequence of substrings that occur in $R$. RLZ is particularly effective at compressing sets of strings that have a high degree of similarity to the reference string, such as a set of genomes of individuals from the same species. With the now cheap cost of DNA sequencing, such data sets have become extremely abundant and are rapidly growing. In this paper, instead of using a single reference string for the entire collection, we investigate the use of different reference strings for subsets of the collection, with the aim of improving compression. In particular, we form a rooted tree (or hierarchy) on the strings and then compressed each string using RLZ with parent as reference, storing only the root of the tree in plain text. To decompress, we traverse the tree in BFS order starting at the root, decompressing children with respect to their parent. We show that this approach leads to a twofold improvement in compression on bacterial genome data sets, with negligible effect on decompression time compared to the standard single reference approach. We show that an effective hierarchy for a given set of strings can be constructed by computing the optimal arborescence of a completed weighted digraph of the strings, with weights as the number of phrases in the RLZ parsing of the source and destination vertices. We further show that instead of computing the complete graph, a sparse graph derived using locality sensitive hashing can significantly reduce the cost of computing a good hierarchy, without adversely effecting compression performance.","PeriodicalId":9448,"journal":{"name":"Bulletin of the Society of Sea Water Science, Japan","volume":"24 1","pages":"18:1-18:16"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bulletin of the Society of Sea Water Science, Japan","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2208.11371","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in which a string $S$ is compressed relative to a second string $R$ (called the reference) by parsing $S$ into a sequence of substrings that occur in $R$. RLZ is particularly effective at compressing sets of strings that have a high degree of similarity to the reference string, such as a set of genomes of individuals from the same species. With the now cheap cost of DNA sequencing, such data sets have become extremely abundant and are rapidly growing. In this paper, instead of using a single reference string for the entire collection, we investigate the use of different reference strings for subsets of the collection, with the aim of improving compression. In particular, we form a rooted tree (or hierarchy) on the strings and then compressed each string using RLZ with parent as reference, storing only the root of the tree in plain text. To decompress, we traverse the tree in BFS order starting at the root, decompressing children with respect to their parent. We show that this approach leads to a twofold improvement in compression on bacterial genome data sets, with negligible effect on decompression time compared to the standard single reference approach. We show that an effective hierarchy for a given set of strings can be constructed by computing the optimal arborescence of a completed weighted digraph of the strings, with weights as the number of phrases in the RLZ parsing of the source and destination vertices. We further show that instead of computing the complete graph, a sparse graph derived using locality sensitive hashing can significantly reduce the cost of computing a good hierarchy, without adversely effecting compression performance.

查看原文本刊更多论文

分层相对Lempel-Ziv压缩

相对Lempel-Ziv (RLZ)解析是一种字典压缩方法，通过将字符串$S$解析为出现在$R$中的子字符串序列，将字符串$S$相对于第二个字符串$R$(称为引用)进行压缩。RLZ在压缩与参考字符串高度相似的字符串集(例如来自同一物种的个体的一组基因组)方面特别有效。随着现在DNA测序成本的降低，这样的数据集变得非常丰富，并且正在迅速增长。在本文中，我们研究了对集合的子集使用不同的引用字符串，而不是对整个集合使用单个引用字符串，目的是提高压缩。特别是，我们在字符串上形成一个有根的树(或层次结构)，然后使用RLZ以parent作为引用压缩每个字符串，仅以纯文本形式存储树的根。为了解压缩，我们以BFS顺序从根节点开始遍历树，相对于父节点解压缩子节点。我们表明，与标准的单一参考方法相比，这种方法导致细菌基因组数据集压缩的两倍改进，对解压时间的影响可以忽略不计。我们表明，对于给定的字符串集合，可以通过计算字符串的完整加权有向图的最优树形来构建有效的层次结构，权重作为源顶点和目标顶点的RLZ解析中的短语数。我们进一步表明，使用位置敏感哈希法派生的稀疏图可以显著降低计算良好层次结构的成本，而不会对压缩性能产生不利影响，而不是计算完整图。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bulletin of the Society of Sea Water Science, Japan

自引率

0.00%

发文量