Two-Level Dictionary-Based Text Compression Scheme

2008 11th International Conference on Computer and Information Technology Pub Date : 2008-12-01 DOI:10.1109/ICCITECHN.2008.4803026

Z.K. Zia, D.F. Rahman, C.M. Rahman

引用次数: 11

Abstract

In this paper a new dictionary and memory based text compression technique is presented called a two-level dictionary based text compression scheme. The original words in a text file are transformed into codewords having length 2 and 3 using a dictionary comprising 73680 frequently used words in English language. Among these words most frequently used words use 2 length codewords and the rest use 3 length codewords for better compression. The codewords are chosen in such way that the spaces between words in the original text file can be removed altogether recovering a substantial amount of space. Another unique feature of our compression scheme is that we have recovered unused bit of ASCII character representation from each character to save one byte per 8 ASCII characters. Lastly a back end existing compression algorithm is used to finally compress the file. We have achieved about 75% (compression ratio of 2.01 bits per input character) reduction in size using our new compression strategy with gzip and bzip2.

查看原文本刊更多论文

基于两级字典的文本压缩方案

本文提出了一种新的基于字典和内存的文本压缩技术，即基于两级字典的文本压缩方案。使用包含73680个英语常用单词的字典，将文本文件中的原始单词转换为长度为2和3的码字。在这些词中，最常用的词使用2长度码字，其余的使用3长度码字，以更好地压缩。选择码字的方式使原始文本文件中单词之间的空格可以完全删除，从而恢复大量的空间。我们的压缩方案的另一个独特特性是，我们从每个字符中恢复了未使用的ASCII字符表示位，以便每8个ASCII字符保存一个字节。最后利用后端现有的压缩算法对文件进行压缩。使用gzip和bzip2的新压缩策略，我们已经实现了大约75%(每个输入字符的压缩比为2.01位)的大小减少。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 11th International Conference on Computer and Information Technology

自引率

0.00%

发文量