Efficient Compression Scheme for Large Natural Text Using Zipf Distribution

2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT) Pub Date : 2019-05-01 DOI:10.1109/ICASERT.2019.8934651

Md. Ashiq Mahmood, K. Hasan

{"title":"Efficient Compression Scheme for Large Natural Text Using Zipf Distribution","authors":"Md. Ashiq Mahmood, K. Hasan","doi":"10.1109/ICASERT.2019.8934651","DOIUrl":null,"url":null,"abstract":"Data compression is the way toward modifying, encoding or changing over the bit structure of data in such a way that it expends less space. Character encoding is somewhat related to data compression which represents a character by some sort of encoding framework. Encoding is the way toward putting a succession of characters into a specific arrangement for effective transmission or capacity. Compression of data covers a giant domain of employments including information correspondence, information storing and database development. In this paper we propose an efficient and new compression algorithm for large natural datasets where any characters is encoded by 5 bits called 5-Bit Compression (5BC). The algorithm manages an encoding procedure by 5 bits for any characters in English and Bangla using table look up. The look up table is constructed by using Zipf distribution. The Zipf distribution is a discrete distribution of commonly used characters in different languages. 8 bit characters are converted to 5 bits by parting the characters into 7 sets and utilizing them in a solitary table. The character’s location is then used uniquely encoding by 5 bits. The text can be compressed by 5BC is more than 60% of the actual text. The algorithm for decompression to recover the original data is depicted also. After the output string of 5BC is produced, LZW and Huffman techniques further compress the output string. Optimistic performance is demonstrated by our experimental result.","PeriodicalId":6613,"journal":{"name":"2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT)","volume":"70 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASERT.2019.8934651","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Data compression is the way toward modifying, encoding or changing over the bit structure of data in such a way that it expends less space. Character encoding is somewhat related to data compression which represents a character by some sort of encoding framework. Encoding is the way toward putting a succession of characters into a specific arrangement for effective transmission or capacity. Compression of data covers a giant domain of employments including information correspondence, information storing and database development. In this paper we propose an efficient and new compression algorithm for large natural datasets where any characters is encoded by 5 bits called 5-Bit Compression (5BC). The algorithm manages an encoding procedure by 5 bits for any characters in English and Bangla using table look up. The look up table is constructed by using Zipf distribution. The Zipf distribution is a discrete distribution of commonly used characters in different languages. 8 bit characters are converted to 5 bits by parting the characters into 7 sets and utilizing them in a solitary table. The character’s location is then used uniquely encoding by 5 bits. The text can be compressed by 5BC is more than 60% of the actual text. The algorithm for decompression to recover the original data is depicted also. After the output string of 5BC is produced, LZW and Huffman techniques further compress the output string. Optimistic performance is demonstrated by our experimental result.

查看原文本刊更多论文

使用Zipf分布的大型自然文本的有效压缩方案

数据压缩是对数据的位结构进行修改、编码或改变的一种方式，这种方式消耗的空间更少。字符编码在某种程度上与数据压缩有关，它通过某种编码框架表示字符。编码是将一串字符按特定的顺序排列以达到有效传输或容量的一种方法。数据压缩涉及的领域非常广泛，包括信息通信、信息存储和数据库开发。在本文中，我们提出了一种高效的新的压缩算法，用于大型自然数据集，其中任何字符都由5位编码，称为5位压缩(5BC)。该算法通过表查找对英语和孟加拉语中任意字符的编码过程进行5位的管理。查找表是使用Zipf分布构造的。Zipf分布是不同语言中常用字符的离散分布。通过将字符分成7组并在单独的表中使用，将8位字符转换为5位字符。字符的位置被唯一地编码为5位。5BC所能压缩的文本是实际文本的60%以上。文中还描述了恢复原始数据的解压缩算法。在产生5BC的输出字符串后，LZW和Huffman技术进一步压缩输出字符串。我们的实验结果证明了乐观的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT)

自引率

0.00%

发文量