ETAOSD: Static dictionary-based transformation method for text compression

Fadlelmoula Mohamed Baloul, Mohsin Hassan Abdullah, E. A. Babikir
{"title":"ETAOSD: Static dictionary-based transformation method for text compression","authors":"Fadlelmoula Mohamed Baloul, Mohsin Hassan Abdullah, E. A. Babikir","doi":"10.1109/ICCEEE.2013.6633967","DOIUrl":null,"url":null,"abstract":"The aim of this paper is to present a new static dictionary-based algorithm for text transformation to increase the data compression ratio when using standard compression tools. The basic idea of the new algorithm is to define a pattern for each word in a static dictionary by replacing all or most of the characters in the words of the dictionary by the most frequently used character in any text file. The proposed algorithm transforms any text file into another encrypted file with a size almost the same as that of the original text file but with different statistical properties. The new transformation method has been designed, implemented, and tested using Gutenburg Corpus. Generally, the output result has shown different levels of enhancements on different common standard data compression tools such as Arithmetic, Huffman, Bzip2, Gzip and WinZip. The compression performance of all common compression tools has been enhanced especially when the patterns of the transformed words passed through costless running length encoding (RLE) algorithm. On using Bzip2, the resultant output files produced about 76.75% as compression ratio with 1.88 as average code length. The final result is very promising and it could be enhanced more in case of applying dynamic dictionary-based text transformation technique.","PeriodicalId":256793,"journal":{"name":"2013 INTERNATIONAL CONFERENCE ON COMPUTING, ELECTRICAL AND ELECTRONIC ENGINEERING (ICCEEE)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 INTERNATIONAL CONFERENCE ON COMPUTING, ELECTRICAL AND ELECTRONIC ENGINEERING (ICCEEE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCEEE.2013.6633967","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The aim of this paper is to present a new static dictionary-based algorithm for text transformation to increase the data compression ratio when using standard compression tools. The basic idea of the new algorithm is to define a pattern for each word in a static dictionary by replacing all or most of the characters in the words of the dictionary by the most frequently used character in any text file. The proposed algorithm transforms any text file into another encrypted file with a size almost the same as that of the original text file but with different statistical properties. The new transformation method has been designed, implemented, and tested using Gutenburg Corpus. Generally, the output result has shown different levels of enhancements on different common standard data compression tools such as Arithmetic, Huffman, Bzip2, Gzip and WinZip. The compression performance of all common compression tools has been enhanced especially when the patterns of the transformed words passed through costless running length encoding (RLE) algorithm. On using Bzip2, the resultant output files produced about 76.75% as compression ratio with 1.88 as average code length. The final result is very promising and it could be enhanced more in case of applying dynamic dictionary-based text transformation technique.
ETAOSD:用于文本压缩的基于静态字典的转换方法
本文的目的是提出一种新的基于静态字典的文本转换算法,以提高使用标准压缩工具时的数据压缩比。新算法的基本思想是用任何文本文件中最常用的字符替换字典中单词中的全部或大部分字符,从而为静态字典中的每个单词定义一个模式。该算法将任意文本文件转换为另一个加密文件,该文件的大小与原始文本文件几乎相同,但具有不同的统计属性。利用古腾堡语料库设计、实现和测试了新的变换方法。一般来说,在不同的常见标准数据压缩工具(如Arithmetic、Huffman、Bzip2、Gzip和WinZip)上,输出结果显示出不同程度的增强。所有常用压缩工具的压缩性能都得到了提高,特别是当转换后的单词的模式通过无成本运行长度编码(RLE)算法时。在使用Bzip2时,生成的输出文件的压缩比约为76.75%,平均代码长度为1.88。最终的结果是很有希望的,并且在应用基于动态字典的文本转换技术的情况下,它可以得到进一步的增强。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信