一种基于索引排列和后缀编码的文本压缩算法

IF 1.9 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Canadian Journal of Electrical and Computer Engineering Pub Date : 2025-08-11 DOI:10.1109/ICJECE.2025.3587644

Emre Erkan;Erdoğan Aldemir;Şehmus Fidan;Hidayet Oğraş

{"title":"一种基于索引排列和后缀编码的文本压缩算法","authors":"Emre Erkan;Erdoğan Aldemir;Şehmus Fidan;Hidayet Oğraş","doi":"10.1109/ICJECE.2025.3587644","DOIUrl":null,"url":null,"abstract":"The rapid generation and utilization of text data, driven by the proliferation of the Internet of Things (IoT) and large language models, has intensified the need for efficient lossless text compression. To address this, we introduce HEES23, a novel lossless compression algorithm specifically designed for English text. HEES23 employs a unique suffix coding scheme incorporating new symbol representations and a fixed, language-optimized table to maximize compression efficiency. Additionally, the adaptive entropy reduction techniques combined with block sorting expose significant empirical entropy and redundancy in raw textual data. A key feature of HEES23 is its recursive mapping mechanism for index encoding and symbol extraction, which iteratively reduces redundancy while preserving data integrity. The algorithm has been experimentally applied to diverse human-generated text datasets and benchmarked against established standards. Results show that HEES23 achieves an average compression ratio exceeding 30% for data sizes as small as 0.1 kB, outperforming methods, such as Deflate, Brotli, LZ77, and bZIP2, which either result in negative compression or offer limited efficiency of around 10%. Furthermore, HEES23 maintains strong performance, achieving compression rates between 53% and 64% on larger and more complex datasets, underscoring its effectiveness for IoT applications requiring long-range, low-bandwidth communication.","PeriodicalId":100619,"journal":{"name":"IEEE Canadian Journal of Electrical and Computer Engineering","volume":"48 3","pages":"268-280"},"PeriodicalIF":1.9000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A New Text Compression Algorithm Based on Index Permutation and Suffix Coding\",\"authors\":\"Emre Erkan;Erdoğan Aldemir;Şehmus Fidan;Hidayet Oğraş\",\"doi\":\"10.1109/ICJECE.2025.3587644\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapid generation and utilization of text data, driven by the proliferation of the Internet of Things (IoT) and large language models, has intensified the need for efficient lossless text compression. To address this, we introduce HEES23, a novel lossless compression algorithm specifically designed for English text. HEES23 employs a unique suffix coding scheme incorporating new symbol representations and a fixed, language-optimized table to maximize compression efficiency. Additionally, the adaptive entropy reduction techniques combined with block sorting expose significant empirical entropy and redundancy in raw textual data. A key feature of HEES23 is its recursive mapping mechanism for index encoding and symbol extraction, which iteratively reduces redundancy while preserving data integrity. The algorithm has been experimentally applied to diverse human-generated text datasets and benchmarked against established standards. Results show that HEES23 achieves an average compression ratio exceeding 30% for data sizes as small as 0.1 kB, outperforming methods, such as Deflate, Brotli, LZ77, and bZIP2, which either result in negative compression or offer limited efficiency of around 10%. Furthermore, HEES23 maintains strong performance, achieving compression rates between 53% and 64% on larger and more complex datasets, underscoring its effectiveness for IoT applications requiring long-range, low-bandwidth communication.\",\"PeriodicalId\":100619,\"journal\":{\"name\":\"IEEE Canadian Journal of Electrical and Computer Engineering\",\"volume\":\"48 3\",\"pages\":\"268-280\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2025-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Canadian Journal of Electrical and Computer Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11122108/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Canadian Journal of Electrical and Computer Engineering","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11122108/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

在物联网（IoT）和大型语言模型的推动下，文本数据的快速生成和利用加剧了对高效无损文本压缩的需求。为了解决这个问题，我们介绍了HEES23，一种专门为英语文本设计的新型无损压缩算法。HEES23采用独特的后缀编码方案，包含新的符号表示和固定的语言优化表，以最大限度地提高压缩效率。此外，自适应熵降技术与块排序相结合，暴露了原始文本数据中显着的经验熵和冗余。HEES23的一个关键特性是其用于索引编码和符号提取的递归映射机制，该机制在保持数据完整性的同时迭代地减少了冗余。该算法已在实验中应用于各种人工生成的文本数据集，并根据既定标准进行基准测试。结果表明，对于小至0.1 kB的数据大小，HEES23的平均压缩比超过30%，优于Deflate、Brotli、LZ77和bZIP2等方法，这些方法要么导致负压缩，要么提供10%左右的有限效率。此外，HEES23保持了强大的性能，在更大、更复杂的数据集上实现了53%至64%的压缩率，强调了其对需要远距离、低带宽通信的物联网应用的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A New Text Compression Algorithm Based on Index Permutation and Suffix Coding

The rapid generation and utilization of text data, driven by the proliferation of the Internet of Things (IoT) and large language models, has intensified the need for efficient lossless text compression. To address this, we introduce HEES23, a novel lossless compression algorithm specifically designed for English text. HEES23 employs a unique suffix coding scheme incorporating new symbol representations and a fixed, language-optimized table to maximize compression efficiency. Additionally, the adaptive entropy reduction techniques combined with block sorting expose significant empirical entropy and redundancy in raw textual data. A key feature of HEES23 is its recursive mapping mechanism for index encoding and symbol extraction, which iteratively reduces redundancy while preserving data integrity. The algorithm has been experimentally applied to diverse human-generated text datasets and benchmarked against established standards. Results show that HEES23 achieves an average compression ratio exceeding 30% for data sizes as small as 0.1 kB, outperforming methods, such as Deflate, Brotli, LZ77, and bZIP2, which either result in negative compression or offer limited efficiency of around 10%. Furthermore, HEES23 maintains strong performance, achieving compression rates between 53% and 64% on larger and more complex datasets, underscoring its effectiveness for IoT applications requiring long-range, low-bandwidth communication.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Canadian Journal of Electrical and Computer Engineering

CiteScore

3.70

自引率

0.00%

发文量