每个块的自然语言压缩

2011 First International Conference on Data Compression, Communications and Processing Pub Date : 2011-06-21 DOI:10.1109/CCP.2011.25

P. Procházka, J. Holub

{"title":"每个块的自然语言压缩","authors":"P. Procházka, J. Holub","doi":"10.1109/CCP.2011.25","DOIUrl":null,"url":null,"abstract":"We present a new natural language compression method: Semi-adaptive Two Byte Dense Code (STBDC). STBDC performs compression per blocks. It means that the input is divided into the several blocks and each of the blocks is compressed separately according to its own statistical model. To avoid the redundancy the final vocabulary file is composed as the sequence of the changes in the model of the two consecutive blocks. STBDC belongs to the family of Dense codes and keeps all their attractive properties including very high compression and decompression speed and acceptable compression ratio around 32 % on natural language text. Moreover STBDC provides other properties applicable in digital libraries and other textual databases. The compression method allows direct searching on the compressed text, whereas the vocabulary can be used as a block index. STBDC is very easy on limited bandwidth in the client/server architecture. It can send namely single compressed blocks only with corresponding part of the vocabulary. Further STBDC enables various approaches of updating and extending of the compressed text.","PeriodicalId":167131,"journal":{"name":"2011 First International Conference on Data Compression, Communications and Processing","volume":"129 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Natural Language Compression per Blocks\",\"authors\":\"P. Procházka, J. Holub\",\"doi\":\"10.1109/CCP.2011.25\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a new natural language compression method: Semi-adaptive Two Byte Dense Code (STBDC). STBDC performs compression per blocks. It means that the input is divided into the several blocks and each of the blocks is compressed separately according to its own statistical model. To avoid the redundancy the final vocabulary file is composed as the sequence of the changes in the model of the two consecutive blocks. STBDC belongs to the family of Dense codes and keeps all their attractive properties including very high compression and decompression speed and acceptable compression ratio around 32 % on natural language text. Moreover STBDC provides other properties applicable in digital libraries and other textual databases. The compression method allows direct searching on the compressed text, whereas the vocabulary can be used as a block index. STBDC is very easy on limited bandwidth in the client/server architecture. It can send namely single compressed blocks only with corresponding part of the vocabulary. Further STBDC enables various approaches of updating and extending of the compressed text.\",\"PeriodicalId\":167131,\"journal\":{\"name\":\"2011 First International Conference on Data Compression, Communications and Processing\",\"volume\":\"129 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 First International Conference on Data Compression, Communications and Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCP.2011.25\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 First International Conference on Data Compression, Communications and Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCP.2011.25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

提出了一种新的自然语言压缩方法:半自适应双字节密集码(STBDC)。STBDC对每个块执行压缩。这意味着输入被分成几个块，每个块根据自己的统计模型被单独压缩。为了避免冗余，最终词汇表文件由两个连续块的模型变化序列组成。STBDC属于密集代码家族，并保持了所有吸引人的特性，包括非常高的压缩和解压缩速度以及在自然语言文本上可接受的约32%的压缩比。此外，STBDC还提供了适用于数字图书馆和其他文本数据库的其他属性。压缩方法允许对压缩文本进行直接搜索，而词汇表可以用作块索引。在客户端/服务器架构中，STBDC在有限的带宽下非常容易实现。它可以发送单个压缩块，只包含词汇表的相应部分。此外，STBDC支持更新和扩展压缩文本的各种方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Natural Language Compression per Blocks

We present a new natural language compression method: Semi-adaptive Two Byte Dense Code (STBDC). STBDC performs compression per blocks. It means that the input is divided into the several blocks and each of the blocks is compressed separately according to its own statistical model. To avoid the redundancy the final vocabulary file is composed as the sequence of the changes in the model of the two consecutive blocks. STBDC belongs to the family of Dense codes and keeps all their attractive properties including very high compression and decompression speed and acceptable compression ratio around 32 % on natural language text. Moreover STBDC provides other properties applicable in digital libraries and other textual databases. The compression method allows direct searching on the compressed text, whereas the vocabulary can be used as a block index. STBDC is very easy on limited bandwidth in the client/server architecture. It can send namely single compressed blocks only with corresponding part of the vocabulary. Further STBDC enables various approaches of updating and extending of the compressed text.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 First International Conference on Data Compression, Communications and Processing

自引率

0.00%

发文量