{"title":"A New Text Compression Algorithm Based on Index Permutation and Suffix Coding","authors":"Emre Erkan;Erdoğan Aldemir;Şehmus Fidan;Hidayet Oğraş","doi":"10.1109/ICJECE.2025.3587644","DOIUrl":null,"url":null,"abstract":"The rapid generation and utilization of text data, driven by the proliferation of the Internet of Things (IoT) and large language models, has intensified the need for efficient lossless text compression. To address this, we introduce HEES23, a novel lossless compression algorithm specifically designed for English text. HEES23 employs a unique suffix coding scheme incorporating new symbol representations and a fixed, language-optimized table to maximize compression efficiency. Additionally, the adaptive entropy reduction techniques combined with block sorting expose significant empirical entropy and redundancy in raw textual data. A key feature of HEES23 is its recursive mapping mechanism for index encoding and symbol extraction, which iteratively reduces redundancy while preserving data integrity. The algorithm has been experimentally applied to diverse human-generated text datasets and benchmarked against established standards. Results show that HEES23 achieves an average compression ratio exceeding 30% for data sizes as small as 0.1 kB, outperforming methods, such as Deflate, Brotli, LZ77, and bZIP2, which either result in negative compression or offer limited efficiency of around 10%. Furthermore, HEES23 maintains strong performance, achieving compression rates between 53% and 64% on larger and more complex datasets, underscoring its effectiveness for IoT applications requiring long-range, low-bandwidth communication.","PeriodicalId":100619,"journal":{"name":"IEEE Canadian Journal of Electrical and Computer Engineering","volume":"48 3","pages":"268-280"},"PeriodicalIF":1.9000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Canadian Journal of Electrical and Computer Engineering","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11122108/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
The rapid generation and utilization of text data, driven by the proliferation of the Internet of Things (IoT) and large language models, has intensified the need for efficient lossless text compression. To address this, we introduce HEES23, a novel lossless compression algorithm specifically designed for English text. HEES23 employs a unique suffix coding scheme incorporating new symbol representations and a fixed, language-optimized table to maximize compression efficiency. Additionally, the adaptive entropy reduction techniques combined with block sorting expose significant empirical entropy and redundancy in raw textual data. A key feature of HEES23 is its recursive mapping mechanism for index encoding and symbol extraction, which iteratively reduces redundancy while preserving data integrity. The algorithm has been experimentally applied to diverse human-generated text datasets and benchmarked against established standards. Results show that HEES23 achieves an average compression ratio exceeding 30% for data sizes as small as 0.1 kB, outperforming methods, such as Deflate, Brotli, LZ77, and bZIP2, which either result in negative compression or offer limited efficiency of around 10%. Furthermore, HEES23 maintains strong performance, achieving compression rates between 53% and 64% on larger and more complex datasets, underscoring its effectiveness for IoT applications requiring long-range, low-bandwidth communication.