Rohit Saluja, D. Adiga, Ganesh Ramakrishnan, P. Chaudhuri, Mark James Carman
{"title":"索引OCR中文档特定错误检测和更正的框架","authors":"Rohit Saluja, D. Adiga, Ganesh Ramakrishnan, P. Chaudhuri, Mark James Carman","doi":"10.1109/ICDAR.2017.308","DOIUrl":null,"url":null,"abstract":"In this paper, we present a framework for assisting word-level corrections in Indic OCR documents by incorporating the ability to identify, segment and combine partially correct word forms. The partially correct word forms themselves may be obtained from corrected parts of the document itself and auxiliary sources such as dictionaries and common OCR character confusions. Our framework updates a domain dictionary and learns OCR specific n-gram confusions from the human feedback on the fly. The framework can also leverage consensus between outputs of multiple OCR systems on the same text as an auxiliary source for dynamic dictionary building. Experimental evaluations confirm that for highly inflectional Indian languages, matching partially correct word forms an result in significant reduction in the amount of manual input required for correction. Furthermore, significant gains are observed when the consolidated output of multiple OCR systems is employed as an auxiliary source of information. We have corrected over 1100 pages (13 books) in Sanskrit, 190 pages (1 book) in Marathi, 50 pages (part of a book) in Hindi and 1000 pages (12 books) in English using our framework. We present a book-wise analysis of improvement in required human interaction for these Languages.","PeriodicalId":433676,"journal":{"name":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"A Framework for Document Specific Error Detection and Corrections in Indic OCR\",\"authors\":\"Rohit Saluja, D. Adiga, Ganesh Ramakrishnan, P. Chaudhuri, Mark James Carman\",\"doi\":\"10.1109/ICDAR.2017.308\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we present a framework for assisting word-level corrections in Indic OCR documents by incorporating the ability to identify, segment and combine partially correct word forms. The partially correct word forms themselves may be obtained from corrected parts of the document itself and auxiliary sources such as dictionaries and common OCR character confusions. Our framework updates a domain dictionary and learns OCR specific n-gram confusions from the human feedback on the fly. The framework can also leverage consensus between outputs of multiple OCR systems on the same text as an auxiliary source for dynamic dictionary building. Experimental evaluations confirm that for highly inflectional Indian languages, matching partially correct word forms an result in significant reduction in the amount of manual input required for correction. Furthermore, significant gains are observed when the consolidated output of multiple OCR systems is employed as an auxiliary source of information. We have corrected over 1100 pages (13 books) in Sanskrit, 190 pages (1 book) in Marathi, 50 pages (part of a book) in Hindi and 1000 pages (12 books) in English using our framework. We present a book-wise analysis of improvement in required human interaction for these Languages.\",\"PeriodicalId\":433676,\"journal\":{\"name\":\"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDAR.2017.308\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2017.308","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Framework for Document Specific Error Detection and Corrections in Indic OCR
In this paper, we present a framework for assisting word-level corrections in Indic OCR documents by incorporating the ability to identify, segment and combine partially correct word forms. The partially correct word forms themselves may be obtained from corrected parts of the document itself and auxiliary sources such as dictionaries and common OCR character confusions. Our framework updates a domain dictionary and learns OCR specific n-gram confusions from the human feedback on the fly. The framework can also leverage consensus between outputs of multiple OCR systems on the same text as an auxiliary source for dynamic dictionary building. Experimental evaluations confirm that for highly inflectional Indian languages, matching partially correct word forms an result in significant reduction in the amount of manual input required for correction. Furthermore, significant gains are observed when the consolidated output of multiple OCR systems is employed as an auxiliary source of information. We have corrected over 1100 pages (13 books) in Sanskrit, 190 pages (1 book) in Marathi, 50 pages (part of a book) in Hindi and 1000 pages (12 books) in English using our framework. We present a book-wise analysis of improvement in required human interaction for these Languages.