Script Identification from Handwritten Document

2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics Pub Date : 2011-12-15 DOI:10.1109/NCVPRIPG.2011.22

K. Roy, S. K. Das, S. Obaidullah

{"title":"Script Identification from Handwritten Document","authors":"K. Roy, S. K. Das, S. Obaidullah","doi":"10.1109/NCVPRIPG.2011.22","DOIUrl":null,"url":null,"abstract":"Every country has their own language and script. This may or may not common to other countries. To communicate with each other we need to have a common language. English is the language that is performing that role. So most of the countries (other than Roman) use bi-script documents. But for countries like India where we have a total of 12 official scripts (and 22 languages) things are more complex. So to have an OCR we need to identify the script by which the script the document is written (even the document is not itself multi-script). Postal document, pre-printed forms are good example of such documents. So identification of the script from a document may be written with any of these 13 scripts is a very challenging work. In this paper we have tried to identify scripts written by any of the 6 official languages of India. Here we have used very simple and efficient feature at component level for the same. Using Fractal-based features, component based feature and Topological features, series of classifiers were used. Overall accuracy of the proposed system is at present 89.48% on the test set without rejection.","PeriodicalId":285162,"journal":{"name":"2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCVPRIPG.2011.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

Every country has their own language and script. This may or may not common to other countries. To communicate with each other we need to have a common language. English is the language that is performing that role. So most of the countries (other than Roman) use bi-script documents. But for countries like India where we have a total of 12 official scripts (and 22 languages) things are more complex. So to have an OCR we need to identify the script by which the script the document is written (even the document is not itself multi-script). Postal document, pre-printed forms are good example of such documents. So identification of the script from a document may be written with any of these 13 scripts is a very challenging work. In this paper we have tried to identify scripts written by any of the 6 official languages of India. Here we have used very simple and efficient feature at component level for the same. Using Fractal-based features, component based feature and Topological features, series of classifiers were used. Overall accuracy of the proposed system is at present 89.48% on the test set without rejection.

查看原文本刊更多论文

手写体文件的文字识别

每个国家都有自己的语言和文字。这在其他国家可能很常见，也可能不常见。为了相互交流，我们需要有一种共同的语言。英语就是扮演这个角色的语言。所以大多数国家(除了罗马)使用双脚本文件。但对于像印度这样的国家，我们总共有12种官方文字(22种语言)，事情就复杂多了。因此，要使用OCR，我们需要识别用于编写文档的脚本的脚本(甚至文档本身也不是多脚本)。邮政文件、预印表格都是这类文件的好例子。因此，识别文档中的脚本可能是用这13个脚本中的任何一个编写的，这是一项非常具有挑战性的工作。在本文中，我们试图识别印度6种官方语言中的任何一种书写的文字。在这里，我们在组件级别使用了非常简单和有效的功能。基于分形特征、基于分量特征和拓扑特征，使用了一系列分类器。目前，该系统在无拒绝的测试集上的总体准确率为89.48%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics

自引率

0.00%

发文量