{"title":"Multi-lingual cascading text compressors for WWW","authors":"Chi-Hung Chi","doi":"10.1109/ITCC.2000.844279","DOIUrl":null,"url":null,"abstract":"Global sharing and distribution of information on the Internet result in a great demand for efficient multi-lingual text compression for Web servers and proxy implementations. Current text compressors such as Huffman coding, Lempel-Ziv (LZ) variants, and LZ-Huffman cascading fail to perform efficiently because of the mis-matched character sampling size and the large character set of multilingual languages. Our previous research has shown that a better compression ratio can be obtained by re-adjusting the character sampling rate. We investigate the cascading of LZ variants to Huffman coding for multilingual documents. Two basic approaches, static and dynamic dictionaries, are proposed. Techniques for reducing the dictionary overhead are also suggested. Based on our multi-lingual corpus, our adaptive cascading scheme can perform better than the well-known cascading compressor, gzip, by an average of about 20%.","PeriodicalId":146581,"journal":{"name":"Proceedings International Conference on Information Technology: Coding and Computing (Cat. No.PR00540)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2000-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings International Conference on Information Technology: Coding and Computing (Cat. No.PR00540)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITCC.2000.844279","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Global sharing and distribution of information on the Internet result in a great demand for efficient multi-lingual text compression for Web servers and proxy implementations. Current text compressors such as Huffman coding, Lempel-Ziv (LZ) variants, and LZ-Huffman cascading fail to perform efficiently because of the mis-matched character sampling size and the large character set of multilingual languages. Our previous research has shown that a better compression ratio can be obtained by re-adjusting the character sampling rate. We investigate the cascading of LZ variants to Huffman coding for multilingual documents. Two basic approaches, static and dynamic dictionaries, are proposed. Techniques for reducing the dictionary overhead are also suggested. Based on our multi-lingual corpus, our adaptive cascading scheme can perform better than the well-known cascading compressor, gzip, by an average of about 20%.