A bilingual Gurmukhi-English OCR based on multiple script identifiers and language models

MOCR '13 Pub Date : 2013-08-24 DOI:10.1145/2505377.2505381
Gurpreet Singh Lehal
{"title":"A bilingual Gurmukhi-English OCR based on multiple script identifiers and language models","authors":"Gurpreet Singh Lehal","doi":"10.1145/2505377.2505381","DOIUrl":null,"url":null,"abstract":"English words are frequently encountered in Gurmukhi texts. A monolingual Gurmukhi OCR will recognize such words as garbage. It becomes necessary to add bilingual capability to the Gurmukhi OCR to recognize English text too. But adding bilingual capability reduces the recognition accuracy for monolingual texts due to errors in script identification. Even a system with 99% script identification accuracy results in reduction of 1% recognition accuracy on monolingual text. In this paper, we present a bilingual OCR, which recognizes both English and Gurmukhi scripts without any significant reduction in recognition accuracy as compared to the monolingual Gurmukhi OCR when recognizing monolingual Gurmukhi text. This is achieved by using multiple script identification engines and language models for both English and Gurmukhi scripts. For the first time, such a system has been developed, which recognizes with high accuracy document images containing mixed Gurmukhi and English text or only Gurmukhi/English text.","PeriodicalId":288465,"journal":{"name":"MOCR '13","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MOCR '13","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2505377.2505381","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

English words are frequently encountered in Gurmukhi texts. A monolingual Gurmukhi OCR will recognize such words as garbage. It becomes necessary to add bilingual capability to the Gurmukhi OCR to recognize English text too. But adding bilingual capability reduces the recognition accuracy for monolingual texts due to errors in script identification. Even a system with 99% script identification accuracy results in reduction of 1% recognition accuracy on monolingual text. In this paper, we present a bilingual OCR, which recognizes both English and Gurmukhi scripts without any significant reduction in recognition accuracy as compared to the monolingual Gurmukhi OCR when recognizing monolingual Gurmukhi text. This is achieved by using multiple script identification engines and language models for both English and Gurmukhi scripts. For the first time, such a system has been developed, which recognizes with high accuracy document images containing mixed Gurmukhi and English text or only Gurmukhi/English text.
基于多个脚本标识符和语言模型的双语Gurmukhi-English OCR
英语单词经常出现在Gurmukhi文本中。单语古尔穆克语OCR将把这些词识别为垃圾。有必要在古慕克语OCR中增加双语功能来识别英语文本。但是,增加双语功能会降低单语文本的识别精度,这是由于脚本识别中的错误。即使系统具有99%的文字识别准确率,对单语文本的识别准确率也会降低1%。在本文中,我们提出了一种双语OCR,它在识别单语廓尔穆克语文本时,与单语廓尔穆克语OCR相比,同时识别英语和廓尔穆克语脚本,而识别准确率没有明显降低。这是通过为英语和Gurmukhi脚本使用多个脚本识别引擎和语言模型来实现的。该系统首次实现了对英语和廓尔穆克语混合文本或只有廓尔穆克语/英语文本的文档图像的高精度识别。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信