{"title":"Error Detection in Indic OCRs","authors":"V. Vinitha, C. V. Jawahar","doi":"10.1109/DAS.2016.31","DOIUrl":null,"url":null,"abstract":"A good post processing module is an indispensable part of an OCR pipeline. In this paper, we propose a novel method for error detection in Indian language OCR output. Our solution uses a recurrent neural network (RNN) for classification of a word as an error or not. We propose a generic error detection method and demonstrate its effectiveness on four popular Indian languages. We divide the words into their constituent aksharas and use their bigram and trigram level information to build a feature representation. In order to train the classifier on incorrect words, we use the mis-recognized words in the output of the OCR. In addition to RNN, we also explore the effectiveness of a generative model such as GMM for our task and demonstrate an improved performance by combining both the approaches. We tested our method on four popular Indian languages and report an average error detection performance above 80%.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAS.2016.31","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
A good post processing module is an indispensable part of an OCR pipeline. In this paper, we propose a novel method for error detection in Indian language OCR output. Our solution uses a recurrent neural network (RNN) for classification of a word as an error or not. We propose a generic error detection method and demonstrate its effectiveness on four popular Indian languages. We divide the words into their constituent aksharas and use their bigram and trigram level information to build a feature representation. In order to train the classifier on incorrect words, we use the mis-recognized words in the output of the OCR. In addition to RNN, we also explore the effectiveness of a generative model such as GMM for our task and demonstrate an improved performance by combining both the approaches. We tested our method on four popular Indian languages and report an average error detection performance above 80%.