基于字符插值n-图模型的日语文档识别

Proceedings of 3rd International Conference on Document Analysis and Recognition Pub Date : 1995-08-14 DOI:10.1109/ICDAR.1995.598993

Hiroki Mori, Hirotomo Aso, S. Makino

{"title":"基于字符插值n-图模型的日语文档识别","authors":"Hiroki Mori, Hirotomo Aso, S. Makino","doi":"10.1109/ICDAR.1995.598993","DOIUrl":null,"url":null,"abstract":"N-gram model is widely applied to various pattern recognition system because it well represents local features of natural languages. In this paper, we describe a contextual postprocessing method using a trigram model of character for Japanese document recognition, and its advantage is revealed by practical experiments. The model is automatically obtained by statistical processing of training documents. The ability to reduce ambiguity is evaluated by the perplexity. In the processing, two smoothing methods are examined, and the predictive power of the deleted interpolation method is shown to be superior. For leading articles, the perplexity reduced to about 22 when using deleted interpolation. The output from OCR is processed very fast using a Viterbi algorithm. Experimental results of recognition for three kinds of documents show that the error correction rates are ranged from 75 to over 90 percent.","PeriodicalId":273519,"journal":{"name":"Proceedings of 3rd International Conference on Document Analysis and Recognition","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1995-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Japanese document recognition based on interpolated n-gram model of character\",\"authors\":\"Hiroki Mori, Hirotomo Aso, S. Makino\",\"doi\":\"10.1109/ICDAR.1995.598993\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"N-gram model is widely applied to various pattern recognition system because it well represents local features of natural languages. In this paper, we describe a contextual postprocessing method using a trigram model of character for Japanese document recognition, and its advantage is revealed by practical experiments. The model is automatically obtained by statistical processing of training documents. The ability to reduce ambiguity is evaluated by the perplexity. In the processing, two smoothing methods are examined, and the predictive power of the deleted interpolation method is shown to be superior. For leading articles, the perplexity reduced to about 22 when using deleted interpolation. The output from OCR is processed very fast using a Viterbi algorithm. Experimental results of recognition for three kinds of documents show that the error correction rates are ranged from 75 to over 90 percent.\",\"PeriodicalId\":273519,\"journal\":{\"name\":\"Proceedings of 3rd International Conference on Document Analysis and Recognition\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1995-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of 3rd International Conference on Document Analysis and Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDAR.1995.598993\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of 3rd International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.1995.598993","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

N-gram模型由于能很好地表征自然语言的局部特征，被广泛应用于各种模式识别系统中。本文提出了一种基于三字符模型的语境后处理方法，并通过实验验证了该方法的优越性。通过对训练文档进行统计处理，自动得到模型。减少歧义的能力是通过困惑度来评估的。在处理过程中，对两种平滑方法进行了比较，结果表明，删除插值方法的预测能力更强。对于主要文章，当使用删除插值时，困惑度减少到22左右。OCR的输出使用Viterbi算法处理得非常快。对三种文档的识别实验结果表明，该方法的错误率在75% ~ 90%以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Japanese document recognition based on interpolated n-gram model of character

N-gram model is widely applied to various pattern recognition system because it well represents local features of natural languages. In this paper, we describe a contextual postprocessing method using a trigram model of character for Japanese document recognition, and its advantage is revealed by practical experiments. The model is automatically obtained by statistical processing of training documents. The ability to reduce ambiguity is evaluated by the perplexity. In the processing, two smoothing methods are examined, and the predictive power of the deleted interpolation method is shown to be superior. For leading articles, the perplexity reduced to about 22 when using deleted interpolation. The output from OCR is processed very fast using a Viterbi algorithm. Experimental results of recognition for three kinds of documents show that the error correction rates are ranged from 75 to over 90 percent.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of 3rd International Conference on Document Analysis and Recognition

自引率

0.00%

发文量