Multi-lingual cascading text compressors for WWW

Proceedings International Conference on Information Technology: Coding and Computing (Cat. No.PR00540) Pub Date : 2000-03-27 DOI:10.1109/ITCC.2000.844279

Chi-Hung Chi

引用次数: 0

Abstract

Global sharing and distribution of information on the Internet result in a great demand for efficient multi-lingual text compression for Web servers and proxy implementations. Current text compressors such as Huffman coding, Lempel-Ziv (LZ) variants, and LZ-Huffman cascading fail to perform efficiently because of the mis-matched character sampling size and the large character set of multilingual languages. Our previous research has shown that a better compression ratio can be obtained by re-adjusting the character sampling rate. We investigate the cascading of LZ variants to Huffman coding for multilingual documents. Two basic approaches, static and dynamic dictionaries, are proposed. Techniques for reducing the dictionary overhead are also suggested. Based on our multi-lingual corpus, our adaptive cascading scheme can perform better than the well-known cascading compressor, gzip, by an average of about 20%.

查看原文本刊更多论文

用于WWW的多语言级联文本压缩器

Internet上信息的全局共享和分发导致对Web服务器和代理实现的高效多语言文本压缩的巨大需求。当前的文本压缩器如Huffman编码、Lempel-Ziv (LZ)变体和LZ-Huffman级联等由于字符采样大小不匹配和多语言语言的大字符集而无法有效执行。我们之前的研究表明，通过重新调整字符采样率可以获得更好的压缩比。我们研究了多语言文档的LZ变体到霍夫曼编码的级联。提出了静态字典和动态字典两种基本方法。还建议了减少字典开销的技术。基于我们的多语言语料库，我们的自适应级联方案比著名的级联压缩器gzip的性能平均提高约20%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings International Conference on Information Technology: Coding and Computing (Cat. No.PR00540)

自引率

0.00%

发文量