Study of Japanese text compression

Proceedings DCC '97. Data Compression Conference Pub Date : 1997-03-25 DOI:10.1109/DCC.1997.582134

N. Satoh, T. Morihara, Y. Okada, S. Yoshida

引用次数: 2

Abstract

Summary form only given. The Japanese language has several thousand distinct characters, and the character code length is 16 bits. In such documents the 16-bit units are interrelated. Conventional text compression employs 8-bit sampling because the compressed object is usually English text. We investigated compression schemes based on 16-bit sampling, expecting it to improve the compression performance. In Japanese text where words are short, statistical schemes with a PPM provide better compression ratios than slide dictionary schemes. So we investigated the 16-bit sampling based on statistical schemes with a PPM model. We show the 16-bit sampling scheme provides good compression ratios in short documents under several tens of kilobytes, such as office reports. The processing speed is also better.

查看原文本刊更多论文

日语文本压缩研究

只提供摘要形式。日语有几千个不同的字符，字符码长度为16位。在这样的文档中，16位单元是相互关联的。传统的文本压缩采用8位采样，因为压缩对象通常是英文文本。我们研究了基于16位采样的压缩方案，期望它能提高压缩性能。在单词较短的日语文本中，具有PPM的统计方案比幻灯片字典方案提供更好的压缩比。因此，我们用PPM模型研究了基于统计方案的16位采样。我们展示了16位采样方案在几十kb以下的简短文档中提供了良好的压缩比，例如办公室报告。处理速度也更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings DCC '97. Data Compression Conference

自引率

0.00%

发文量