A bilingual Gurmukhi-English OCR based on multiple script identifiers and language models

MOCR '13 Pub Date : 2013-08-24 DOI:10.1145/2505377.2505381

Gurpreet Singh Lehal

引用次数: 3

Abstract

English words are frequently encountered in Gurmukhi texts. A monolingual Gurmukhi OCR will recognize such words as garbage. It becomes necessary to add bilingual capability to the Gurmukhi OCR to recognize English text too. But adding bilingual capability reduces the recognition accuracy for monolingual texts due to errors in script identification. Even a system with 99% script identification accuracy results in reduction of 1% recognition accuracy on monolingual text. In this paper, we present a bilingual OCR, which recognizes both English and Gurmukhi scripts without any significant reduction in recognition accuracy as compared to the monolingual Gurmukhi OCR when recognizing monolingual Gurmukhi text. This is achieved by using multiple script identification engines and language models for both English and Gurmukhi scripts. For the first time, such a system has been developed, which recognizes with high accuracy document images containing mixed Gurmukhi and English text or only Gurmukhi/English text.

查看原文本刊更多论文

基于多个脚本标识符和语言模型的双语Gurmukhi-English OCR

英语单词经常出现在Gurmukhi文本中。单语古尔穆克语OCR将把这些词识别为垃圾。有必要在古慕克语OCR中增加双语功能来识别英语文本。但是，增加双语功能会降低单语文本的识别精度，这是由于脚本识别中的错误。即使系统具有99%的文字识别准确率，对单语文本的识别准确率也会降低1%。在本文中，我们提出了一种双语OCR，它在识别单语廓尔穆克语文本时，与单语廓尔穆克语OCR相比，同时识别英语和廓尔穆克语脚本，而识别准确率没有明显降低。这是通过为英语和Gurmukhi脚本使用多个脚本识别引擎和语言模型来实现的。该系统首次实现了对英语和廓尔穆克语混合文本或只有廓尔穆克语/英语文本的文档图像的高精度识别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

MOCR '13

自引率

0.00%

发文量