{"title":"广义词方程:一种新的数据压缩方法","authors":"Michal Kutwin, Wojciech Plandowski, Artur Zaroda","doi":"10.1109/DCC.2019.00097","DOIUrl":null,"url":null,"abstract":"Let Σ be an alphabet. A generalized word equation, GWE for short, is a set of triples and pairs. A triple is in form (p, q, l) where p, q, l are positive integers. A pair is in form (a, i) where a ∊ S and i is a positive integer. A solution of a word equation e is any word w such that, for each triple (p, q, l) in e, w[p..p + l − 1] = w[q..q + l − 1] and, for each pair (a, i) in e, w[i] = a. If there is only one shortest solution w of e, then we say that e defines w. Observe here that if e defines w, then the solution set of e is {ws : s ∊ Σ*}. The triples and pairs of an equation e are called constraints. If an equation e defines a word w, we say that e is a compressed representation of w. Let G be a GWE with m triples and pairs defining a word w. There is an algorithm reconstructing w from G in O(m+ |w|) worst case time [1]. Therefore decompression is optimal. It is not difficult to prove that in simple modifications of GWE generalize LZ77, LZ78 and LZW algorithms. We consider a natural variant of GWE called pGWE and prove that, for a word w, it is a little more efficent and more general than LZ77 for a reversed word wR. Moreover, it can be proved that GWE approach generalizes the bidirectional scheme. We compared GWE with Straight Line Programs (SLP for short) [2, 3] and prove that if SLP for a word w is of length n, then there is a GWE defining w with n constraints. We are not aware of any reasonable simulation in the other direction. We propose a variant of GWE which compresses an input word w in O(|w|L2) worse case time where L is the longest repeating factor in w. This version was tested on files in Canterbury Corpus. It gives better results than gzip on text files and slightly worse on the other files. It is worth mentioning here that gzip is a result of 20 years studies on LZ77 so it is unfair to compare it with our approach. Our current best approach is significantly worse than bzip2 which is based on the Burrows-Wheeler transform.","PeriodicalId":167723,"journal":{"name":"2019 Data Compression Conference (DCC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Generalized Word Equations: A New Approach to Data Compresion\",\"authors\":\"Michal Kutwin, Wojciech Plandowski, Artur Zaroda\",\"doi\":\"10.1109/DCC.2019.00097\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Let Σ be an alphabet. A generalized word equation, GWE for short, is a set of triples and pairs. A triple is in form (p, q, l) where p, q, l are positive integers. A pair is in form (a, i) where a ∊ S and i is a positive integer. A solution of a word equation e is any word w such that, for each triple (p, q, l) in e, w[p..p + l − 1] = w[q..q + l − 1] and, for each pair (a, i) in e, w[i] = a. If there is only one shortest solution w of e, then we say that e defines w. Observe here that if e defines w, then the solution set of e is {ws : s ∊ Σ*}. The triples and pairs of an equation e are called constraints. If an equation e defines a word w, we say that e is a compressed representation of w. Let G be a GWE with m triples and pairs defining a word w. There is an algorithm reconstructing w from G in O(m+ |w|) worst case time [1]. Therefore decompression is optimal. It is not difficult to prove that in simple modifications of GWE generalize LZ77, LZ78 and LZW algorithms. We consider a natural variant of GWE called pGWE and prove that, for a word w, it is a little more efficent and more general than LZ77 for a reversed word wR. Moreover, it can be proved that GWE approach generalizes the bidirectional scheme. We compared GWE with Straight Line Programs (SLP for short) [2, 3] and prove that if SLP for a word w is of length n, then there is a GWE defining w with n constraints. We are not aware of any reasonable simulation in the other direction. We propose a variant of GWE which compresses an input word w in O(|w|L2) worse case time where L is the longest repeating factor in w. This version was tested on files in Canterbury Corpus. It gives better results than gzip on text files and slightly worse on the other files. It is worth mentioning here that gzip is a result of 20 years studies on LZ77 so it is unfair to compare it with our approach. Our current best approach is significantly worse than bzip2 which is based on the Burrows-Wheeler transform.\",\"PeriodicalId\":167723,\"journal\":{\"name\":\"2019 Data Compression Conference (DCC)\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-03-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 Data Compression Conference (DCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.2019.00097\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Data Compression Conference (DCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2019.00097","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
摘要
让Σ成为一个字母。广义词方程(简称GWE)是一组三元组和对。三元组的形式是(p, q, l),其中p, q, l是正整数。一对是(A, i)形式的,其中A是一个正整数,且i是一个正整数。单词方程e的解是任意单词w,使得对于e中的每个三元组(p, q, l), w[p..]P + l−1]= w[q..]q + l−1],对于e中的每一对(a, i), w[i] = a。如果e只有一个最短解w,那么我们说e定义了w。观察这里,如果e定义了w,那么e的解集是{ws: s Σ*}。方程e的三元组和对称为约束。如果方程e定义了一个单词w,我们说e是w的压缩表示。设G是一个包含m个三元组和对的GWE,定义了一个单词w。有一种算法在O(m+ |w|)最坏情况时间内从G重构w[1]。因此,解压是最优的。不难证明,通过对GWE的简单修改,可以推广LZ77、LZ78和LZW算法。我们考虑GWE的一种自然变体pGWE,并证明对于单词w,它比LZ77对于反向单词wR更有效,更通用。此外,还证明了GWE方法对双向方案的推广。我们将GWE与直线规划(Straight Line Programs,简写为SLP)[2,3]进行比较,证明了如果长度为n的单词w的SLP,则存在一个定义有n个约束的w的GWE。我们不知道在其他方向上有任何合理的模拟。我们提出了一个GWE的变体,它在O(|w|L2)最坏的情况下压缩输入单词w,其中L是w中最长的重复因子。这个版本在坎特伯雷语料库的文件上进行了测试。它在文本文件上比gzip提供更好的结果,在其他文件上稍微差一些。值得一提的是,gzip是对LZ77进行了20年研究的结果,因此将其与我们的方法进行比较是不公平的。我们目前最好的方法比基于Burrows-Wheeler变换的bzip2要差得多。
Generalized Word Equations: A New Approach to Data Compresion
Let Σ be an alphabet. A generalized word equation, GWE for short, is a set of triples and pairs. A triple is in form (p, q, l) where p, q, l are positive integers. A pair is in form (a, i) where a ∊ S and i is a positive integer. A solution of a word equation e is any word w such that, for each triple (p, q, l) in e, w[p..p + l − 1] = w[q..q + l − 1] and, for each pair (a, i) in e, w[i] = a. If there is only one shortest solution w of e, then we say that e defines w. Observe here that if e defines w, then the solution set of e is {ws : s ∊ Σ*}. The triples and pairs of an equation e are called constraints. If an equation e defines a word w, we say that e is a compressed representation of w. Let G be a GWE with m triples and pairs defining a word w. There is an algorithm reconstructing w from G in O(m+ |w|) worst case time [1]. Therefore decompression is optimal. It is not difficult to prove that in simple modifications of GWE generalize LZ77, LZ78 and LZW algorithms. We consider a natural variant of GWE called pGWE and prove that, for a word w, it is a little more efficent and more general than LZ77 for a reversed word wR. Moreover, it can be proved that GWE approach generalizes the bidirectional scheme. We compared GWE with Straight Line Programs (SLP for short) [2, 3] and prove that if SLP for a word w is of length n, then there is a GWE defining w with n constraints. We are not aware of any reasonable simulation in the other direction. We propose a variant of GWE which compresses an input word w in O(|w|L2) worse case time where L is the longest repeating factor in w. This version was tested on files in Canterbury Corpus. It gives better results than gzip on text files and slightly worse on the other files. It is worth mentioning here that gzip is a result of 20 years studies on LZ77 so it is unfair to compare it with our approach. Our current best approach is significantly worse than bzip2 which is based on the Burrows-Wheeler transform.