Towards a Robust OCR System for Indic Scripts

2014 11th IAPR International Workshop on Document Analysis Systems Pub Date : 2014-04-07 DOI:10.1109/DAS.2014.74

Praveen Krishnan, Naveen Sankaran, A. Singh, C. V. Jawahar

{"title":"Towards a Robust OCR System for Indic Scripts","authors":"Praveen Krishnan, Naveen Sankaran, A. Singh, C. V. Jawahar","doi":"10.1109/DAS.2014.74","DOIUrl":null,"url":null,"abstract":"The current Optical Character Recognition OCR systems for Indic scripts are not robust enough for recognizing arbitrary collection of printed documents. Reasons for this limitation includes the lack of resources (e.g. not enough examples with natural variations, lack of documentation available about the possible font/style variations) and the architecture which necessitates hard segmentation of word images followed by an isolated symbol recognition. Variations among scripts, latent symbol to UNICODE conversion rules, non-standard fonts/styles and large degradations are some of the major reasons for the unavailability of robust solutions. In this paper, we propose a web based OCR system which (i) follows a unified architecture for seven Indian languages, (ii) is robust against popular degradations, (iii) follows a segmentation free approach, (iv) addresses the UNICODE re-ordering issues, and (v) can enable continuous learning with user inputs and feedbacks. Our system is designed to aid the continuous learning while being usable i.e., we capture the user inputs (say example images) for further improving the OCRs. We use the popular BLSTM based transcription scheme to achieve our target. This also enables incremental training and refinement in a seamless manner. We report superior accuracy rates in comparison with the available OCRs for the seven Indian languages.","PeriodicalId":220495,"journal":{"name":"2014 11th IAPR International Workshop on Document Analysis Systems","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 11th IAPR International Workshop on Document Analysis Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAS.2014.74","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

The current Optical Character Recognition OCR systems for Indic scripts are not robust enough for recognizing arbitrary collection of printed documents. Reasons for this limitation includes the lack of resources (e.g. not enough examples with natural variations, lack of documentation available about the possible font/style variations) and the architecture which necessitates hard segmentation of word images followed by an isolated symbol recognition. Variations among scripts, latent symbol to UNICODE conversion rules, non-standard fonts/styles and large degradations are some of the major reasons for the unavailability of robust solutions. In this paper, we propose a web based OCR system which (i) follows a unified architecture for seven Indian languages, (ii) is robust against popular degradations, (iii) follows a segmentation free approach, (iv) addresses the UNICODE re-ordering issues, and (v) can enable continuous learning with user inputs and feedbacks. Our system is designed to aid the continuous learning while being usable i.e., we capture the user inputs (say example images) for further improving the OCRs. We use the popular BLSTM based transcription scheme to achieve our target. This also enables incremental training and refinement in a seamless manner. We report superior accuracy rates in comparison with the available OCRs for the seven Indian languages.

查看原文本刊更多论文

面向印度文字的健壮OCR系统

目前用于印度文字的光学字符识别OCR系统不够健壮，无法识别任意的打印文档集合。造成这种限制的原因包括缺乏资源(例如，没有足够的自然变化的例子，缺乏关于可能的字体/样式变化的可用文档)，以及需要对单词图像进行硬分割，然后进行孤立的符号识别的架构。脚本之间的差异、潜在的符号到UNICODE的转换规则、非标准字体/样式和大的降级是无法获得健壮解决方案的一些主要原因。在本文中，我们提出了一个基于web的OCR系统，该系统(i)遵循七种印度语言的统一架构，(ii)对流行的退化具有鲁棒性，(iii)遵循无分割方法，(iv)解决UNICODE重新排序问题，(v)可以通过用户输入和反馈实现持续学习。我们的系统旨在帮助持续学习，同时可用，也就是说，我们捕获用户输入(例如示例图像)以进一步提高ocr。我们使用流行的基于BLSTM的转录方案来实现我们的目标。这也支持以无缝的方式进行增量训练和改进。与七种印度语言的ocr相比，我们报告了更高的准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 11th IAPR International Workshop on Document Analysis Systems

自引率

0.00%

发文量