基于两级字典的文本压缩方案

2008 11th International Conference on Computer and Information Technology Pub Date : 2008-12-01 DOI:10.1109/ICCITECHN.2008.4803026

Z.K. Zia, D.F. Rahman, C.M. Rahman

{"title":"基于两级字典的文本压缩方案","authors":"Z.K. Zia, D.F. Rahman, C.M. Rahman","doi":"10.1109/ICCITECHN.2008.4803026","DOIUrl":null,"url":null,"abstract":"In this paper a new dictionary and memory based text compression technique is presented called a two-level dictionary based text compression scheme. The original words in a text file are transformed into codewords having length 2 and 3 using a dictionary comprising 73680 frequently used words in English language. Among these words most frequently used words use 2 length codewords and the rest use 3 length codewords for better compression. The codewords are chosen in such way that the spaces between words in the original text file can be removed altogether recovering a substantial amount of space. Another unique feature of our compression scheme is that we have recovered unused bit of ASCII character representation from each character to save one byte per 8 ASCII characters. Lastly a back end existing compression algorithm is used to finally compress the file. We have achieved about 75% (compression ratio of 2.01 bits per input character) reduction in size using our new compression strategy with gzip and bzip2.","PeriodicalId":335795,"journal":{"name":"2008 11th International Conference on Computer and Information Technology","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Two-Level Dictionary-Based Text Compression Scheme\",\"authors\":\"Z.K. Zia, D.F. Rahman, C.M. Rahman\",\"doi\":\"10.1109/ICCITECHN.2008.4803026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper a new dictionary and memory based text compression technique is presented called a two-level dictionary based text compression scheme. The original words in a text file are transformed into codewords having length 2 and 3 using a dictionary comprising 73680 frequently used words in English language. Among these words most frequently used words use 2 length codewords and the rest use 3 length codewords for better compression. The codewords are chosen in such way that the spaces between words in the original text file can be removed altogether recovering a substantial amount of space. Another unique feature of our compression scheme is that we have recovered unused bit of ASCII character representation from each character to save one byte per 8 ASCII characters. Lastly a back end existing compression algorithm is used to finally compress the file. We have achieved about 75% (compression ratio of 2.01 bits per input character) reduction in size using our new compression strategy with gzip and bzip2.\",\"PeriodicalId\":335795,\"journal\":{\"name\":\"2008 11th International Conference on Computer and Information Technology\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 11th International Conference on Computer and Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCITECHN.2008.4803026\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 11th International Conference on Computer and Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCITECHN.2008.4803026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

本文提出了一种新的基于字典和内存的文本压缩技术，即基于两级字典的文本压缩方案。使用包含73680个英语常用单词的字典，将文本文件中的原始单词转换为长度为2和3的码字。在这些词中，最常用的词使用2长度码字，其余的使用3长度码字，以更好地压缩。选择码字的方式使原始文本文件中单词之间的空格可以完全删除，从而恢复大量的空间。我们的压缩方案的另一个独特特性是，我们从每个字符中恢复了未使用的ASCII字符表示位，以便每8个ASCII字符保存一个字节。最后利用后端现有的压缩算法对文件进行压缩。使用gzip和bzip2的新压缩策略，我们已经实现了大约75%(每个输入字符的压缩比为2.01位)的大小减少。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Two-Level Dictionary-Based Text Compression Scheme

In this paper a new dictionary and memory based text compression technique is presented called a two-level dictionary based text compression scheme. The original words in a text file are transformed into codewords having length 2 and 3 using a dictionary comprising 73680 frequently used words in English language. Among these words most frequently used words use 2 length codewords and the rest use 3 length codewords for better compression. The codewords are chosen in such way that the spaces between words in the original text file can be removed altogether recovering a substantial amount of space. Another unique feature of our compression scheme is that we have recovered unused bit of ASCII character representation from each character to save one byte per 8 ASCII characters. Lastly a back end existing compression algorithm is used to finally compress the file. We have achieved about 75% (compression ratio of 2.01 bits per input character) reduction in size using our new compression strategy with gzip and bzip2.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2008 11th International Conference on Computer and Information Technology

自引率

0.00%

发文量