{"title":"Two-Level Dictionary-Based Text Compression Scheme","authors":"Z.K. Zia, D.F. Rahman, C.M. Rahman","doi":"10.1109/ICCITECHN.2008.4803026","DOIUrl":null,"url":null,"abstract":"In this paper a new dictionary and memory based text compression technique is presented called a two-level dictionary based text compression scheme. The original words in a text file are transformed into codewords having length 2 and 3 using a dictionary comprising 73680 frequently used words in English language. Among these words most frequently used words use 2 length codewords and the rest use 3 length codewords for better compression. The codewords are chosen in such way that the spaces between words in the original text file can be removed altogether recovering a substantial amount of space. Another unique feature of our compression scheme is that we have recovered unused bit of ASCII character representation from each character to save one byte per 8 ASCII characters. Lastly a back end existing compression algorithm is used to finally compress the file. We have achieved about 75% (compression ratio of 2.01 bits per input character) reduction in size using our new compression strategy with gzip and bzip2.","PeriodicalId":335795,"journal":{"name":"2008 11th International Conference on Computer and Information Technology","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 11th International Conference on Computer and Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCITECHN.2008.4803026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11
Abstract
In this paper a new dictionary and memory based text compression technique is presented called a two-level dictionary based text compression scheme. The original words in a text file are transformed into codewords having length 2 and 3 using a dictionary comprising 73680 frequently used words in English language. Among these words most frequently used words use 2 length codewords and the rest use 3 length codewords for better compression. The codewords are chosen in such way that the spaces between words in the original text file can be removed altogether recovering a substantial amount of space. Another unique feature of our compression scheme is that we have recovered unused bit of ASCII character representation from each character to save one byte per 8 ASCII characters. Lastly a back end existing compression algorithm is used to finally compress the file. We have achieved about 75% (compression ratio of 2.01 bits per input character) reduction in size using our new compression strategy with gzip and bzip2.