Japanese text compression using word-based coding

Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225) Pub Date : 1998-03-30 DOI:10.1109/DCC.1998.672306

T. Morihara, N. Satoh, H. Yahagi, S. Yoshida

引用次数: 2

Abstract

Summary form only given. Since Japanese characters are encoded in 16-bit, their large sizes have made compression using 8-bit character sampling coding methods difficult. At DCC'97, Satoh et al. (1997) reported that the 16-bit character sampling adaptive arithmetic is effective in improving the compression ratio. However, the adaptive compression method does not work well on small sized documents which are produced in the office by groupware and E-mail. The present paper studies a word-based semi-adaptive compression method for Japanese text for the purpose of obtaining good compression performance on various document sizes. The algorithm is composed of two stages. The first stage converts input strings into the word-index numbers (intermediate data) corresponding to the longest matching strings in the dictionary. The second stage reduces the redundancy of the intermediate data. We adopted a 16-bit word-index, and first order context 16-bit sampling PPMC2 (16 bit-PPM) for entropy coding in the second stage.

查看原文本刊更多论文

使用基于单词的编码进行日文文本压缩

只提供摘要形式。由于日文字符以16位编码，它们的大尺寸使得使用8位字符采样编码方法进行压缩变得困难。在DCC'97上，Satoh等人(1997)报道了16位字符采样自适应算法在提高压缩比方面是有效的。但是，这种自适应压缩方法对于在办公室中通过群件和电子邮件产生的小文件并不适用。本文研究了一种基于词的日语文本半自适应压缩方法，目的是在各种文档大小下获得良好的压缩性能。该算法分为两个阶段。第一阶段将输入字符串转换为与字典中最长匹配字符串相对应的单词索引号(中间数据)。第二阶段减少中间数据的冗余。第二阶段采用16位字索引和一阶上下文16位采样PPMC2(16位- ppm)进行熵编码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)

自引率

0.00%

发文量