{"title":"使用基于单词的编码进行日文文本压缩","authors":"T. Morihara, N. Satoh, H. Yahagi, S. Yoshida","doi":"10.1109/DCC.1998.672306","DOIUrl":null,"url":null,"abstract":"Summary form only given. Since Japanese characters are encoded in 16-bit, their large sizes have made compression using 8-bit character sampling coding methods difficult. At DCC'97, Satoh et al. (1997) reported that the 16-bit character sampling adaptive arithmetic is effective in improving the compression ratio. However, the adaptive compression method does not work well on small sized documents which are produced in the office by groupware and E-mail. The present paper studies a word-based semi-adaptive compression method for Japanese text for the purpose of obtaining good compression performance on various document sizes. The algorithm is composed of two stages. The first stage converts input strings into the word-index numbers (intermediate data) corresponding to the longest matching strings in the dictionary. The second stage reduces the redundancy of the intermediate data. We adopted a 16-bit word-index, and first order context 16-bit sampling PPMC2 (16 bit-PPM) for entropy coding in the second stage.","PeriodicalId":191890,"journal":{"name":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Japanese text compression using word-based coding\",\"authors\":\"T. Morihara, N. Satoh, H. Yahagi, S. Yoshida\",\"doi\":\"10.1109/DCC.1998.672306\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary form only given. Since Japanese characters are encoded in 16-bit, their large sizes have made compression using 8-bit character sampling coding methods difficult. At DCC'97, Satoh et al. (1997) reported that the 16-bit character sampling adaptive arithmetic is effective in improving the compression ratio. However, the adaptive compression method does not work well on small sized documents which are produced in the office by groupware and E-mail. The present paper studies a word-based semi-adaptive compression method for Japanese text for the purpose of obtaining good compression performance on various document sizes. The algorithm is composed of two stages. The first stage converts input strings into the word-index numbers (intermediate data) corresponding to the longest matching strings in the dictionary. The second stage reduces the redundancy of the intermediate data. We adopted a 16-bit word-index, and first order context 16-bit sampling PPMC2 (16 bit-PPM) for entropy coding in the second stage.\",\"PeriodicalId\":191890,\"journal\":{\"name\":\"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1998-03-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1998.672306\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1998.672306","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Summary form only given. Since Japanese characters are encoded in 16-bit, their large sizes have made compression using 8-bit character sampling coding methods difficult. At DCC'97, Satoh et al. (1997) reported that the 16-bit character sampling adaptive arithmetic is effective in improving the compression ratio. However, the adaptive compression method does not work well on small sized documents which are produced in the office by groupware and E-mail. The present paper studies a word-based semi-adaptive compression method for Japanese text for the purpose of obtaining good compression performance on various document sizes. The algorithm is composed of two stages. The first stage converts input strings into the word-index numbers (intermediate data) corresponding to the longest matching strings in the dictionary. The second stage reduces the redundancy of the intermediate data. We adopted a 16-bit word-index, and first order context 16-bit sampling PPMC2 (16 bit-PPM) for entropy coding in the second stage.