{"title":"Japanese document recognition based on interpolated n-gram model of character","authors":"Hiroki Mori, Hirotomo Aso, S. Makino","doi":"10.1109/ICDAR.1995.598993","DOIUrl":null,"url":null,"abstract":"N-gram model is widely applied to various pattern recognition system because it well represents local features of natural languages. In this paper, we describe a contextual postprocessing method using a trigram model of character for Japanese document recognition, and its advantage is revealed by practical experiments. The model is automatically obtained by statistical processing of training documents. The ability to reduce ambiguity is evaluated by the perplexity. In the processing, two smoothing methods are examined, and the predictive power of the deleted interpolation method is shown to be superior. For leading articles, the perplexity reduced to about 22 when using deleted interpolation. The output from OCR is processed very fast using a Viterbi algorithm. Experimental results of recognition for three kinds of documents show that the error correction rates are ranged from 75 to over 90 percent.","PeriodicalId":273519,"journal":{"name":"Proceedings of 3rd International Conference on Document Analysis and Recognition","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1995-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of 3rd International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.1995.598993","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
N-gram model is widely applied to various pattern recognition system because it well represents local features of natural languages. In this paper, we describe a contextual postprocessing method using a trigram model of character for Japanese document recognition, and its advantage is revealed by practical experiments. The model is automatically obtained by statistical processing of training documents. The ability to reduce ambiguity is evaluated by the perplexity. In the processing, two smoothing methods are examined, and the predictive power of the deleted interpolation method is shown to be superior. For leading articles, the perplexity reduced to about 22 when using deleted interpolation. The output from OCR is processed very fast using a Viterbi algorithm. Experimental results of recognition for three kinds of documents show that the error correction rates are ranged from 75 to over 90 percent.