Design consideration for multi-lingual cascading text compressors

Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096) Pub Date : 1999-03-29 DOI:10.1109/DCC.1999.785677

Chi-Hung Chi, IV YanZhang

{"title":"Design consideration for multi-lingual cascading text compressors","authors":"Chi-Hung Chi, IV YanZhang","doi":"10.1109/DCC.1999.785677","DOIUrl":null,"url":null,"abstract":"Summary form only given. We study the cascading of LZ variants to Huffman coding for multilingual documents. Two models are proposed: the static model and the adaptive (dynamic) model. The static model makes use of the dictionary generated by the LZW algorithm in Chinese dictionary-based Huffman compression to achieve better performance. The dynamic model is an extension of the static cascading model. During the insertion of phrases into the dictionary the frequency count of the phrases is updated so that a dynamic Huffman tree with variable length output tokens is obtained. We propose a new method to capture the \"LZW dictionary\" \"by picking up the dictionary entries during decompression. The general idea is the adding of delimiters during the decompression process so that the decompressed files are segmented into phrases that reflect how the LZW compressor makes use of its dictionary phrases to encode the source. The idea of the adaptive cascading model can be thought as an extension of the Chinese LZW compression. Since the size of the header is one important performance bottleneck in the static cascading model we propose the adaptive cascading model to address this issue. The LZW compressor is now outputting not a fixed length token, but a variable length Huffman code from the Huffman tree. It is expected that such a compressor can achieve very good compression performance. In our adaptive cascading model we choose LZW instead of LZSS because the LZW algorithm preserves more information than the LZSS algorithm does. This characteristic is found to be very useful in helping Chinese compressors to attain better performance.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1999.785677","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Summary form only given. We study the cascading of LZ variants to Huffman coding for multilingual documents. Two models are proposed: the static model and the adaptive (dynamic) model. The static model makes use of the dictionary generated by the LZW algorithm in Chinese dictionary-based Huffman compression to achieve better performance. The dynamic model is an extension of the static cascading model. During the insertion of phrases into the dictionary the frequency count of the phrases is updated so that a dynamic Huffman tree with variable length output tokens is obtained. We propose a new method to capture the "LZW dictionary" "by picking up the dictionary entries during decompression. The general idea is the adding of delimiters during the decompression process so that the decompressed files are segmented into phrases that reflect how the LZW compressor makes use of its dictionary phrases to encode the source. The idea of the adaptive cascading model can be thought as an extension of the Chinese LZW compression. Since the size of the header is one important performance bottleneck in the static cascading model we propose the adaptive cascading model to address this issue. The LZW compressor is now outputting not a fixed length token, but a variable length Huffman code from the Huffman tree. It is expected that such a compressor can achieve very good compression performance. In our adaptive cascading model we choose LZW instead of LZSS because the LZW algorithm preserves more information than the LZSS algorithm does. This characteristic is found to be very useful in helping Chinese compressors to attain better performance.

查看原文本刊更多论文

多语言级联文本压缩器的设计考虑

只提供摘要形式。我们研究了多语言文档的LZ变体到霍夫曼编码的级联。提出了静态模型和自适应(动态)模型。静态模型在基于中文字典的霍夫曼压缩中利用LZW算法生成的字典来达到更好的性能。动态模型是静态级联模型的扩展。在将短语插入字典期间，更新短语的频率计数，从而获得具有可变长度输出令牌的动态霍夫曼树。我们提出了一种通过在解压过程中提取字典条目来捕获“LZW字典”的新方法。一般思想是在解压缩过程中添加分隔符，以便将解压缩的文件分割成短语，以反映LZW压缩器如何使用其字典短语对源进行编码。自适应级联模型的思想可以看作是对中国LZW压缩的扩展。由于报头的大小是静态级联模型中一个重要的性能瓶颈，我们提出了自适应级联模型来解决这个问题。LZW压缩器现在输出的不是固定长度的令牌，而是来自霍夫曼树的可变长度的霍夫曼代码。期望这样的压缩机可以达到非常好的压缩性能。在我们的自适应级联模型中，我们选择LZW而不是LZSS，因为LZW算法比LZSS算法保留了更多的信息。这一特性被发现对帮助中国压缩机获得更好的性能非常有用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)

自引率

0.00%

发文量