从手写文档图像中识别印度文字——一种不受约束的块级方法

2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS) Pub Date : 2015-07-09 DOI:10.1109/ReTIS.2015.7232880

S. Obaidullah, N. Das, C. Halder, K. Roy

{"title":"从手写文档图像中识别印度文字——一种不受约束的块级方法","authors":"S. Obaidullah, N. Das, C. Halder, K. Roy","doi":"10.1109/ReTIS.2015.7232880","DOIUrl":null,"url":null,"abstract":"In a multi-script country like India, prior identification of script from document images is an essential step before choosing appropriate script specific OCR. The problem becomes more complex and challenging in case of HSI (Handwritten Script Identification). An automatic HSI technique for document images of six popular Indic scripts namely Bangla, Devanagari, Malayalam, Oriya, Roman and Urdu is proposed in this paper. A Block-level approach is followed for the same and initially 34-dimensional feature vector is constructed applying transform based (BRT, BDCT, BFFT and BDT), textural and statistical techniques. Finally using a GAS (Greedy Attribute Selection) method 20 attributes are selected for learning process. Total 600 unconstrained document image blocks of size 512×512 each, are prepared with equal distribution of each script type. The whole dataset is divided into 2:1 ratio for training and testing. Extensive experimentation is carried out for Six-scripts, Tetra-scripts, Tri-scripts and Bi-scripts combinations. Experimental result shows promising and comparable performance.","PeriodicalId":161306,"journal":{"name":"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"Indic script identification from handwritten document images — An unconstrained block-level approach\",\"authors\":\"S. Obaidullah, N. Das, C. Halder, K. Roy\",\"doi\":\"10.1109/ReTIS.2015.7232880\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In a multi-script country like India, prior identification of script from document images is an essential step before choosing appropriate script specific OCR. The problem becomes more complex and challenging in case of HSI (Handwritten Script Identification). An automatic HSI technique for document images of six popular Indic scripts namely Bangla, Devanagari, Malayalam, Oriya, Roman and Urdu is proposed in this paper. A Block-level approach is followed for the same and initially 34-dimensional feature vector is constructed applying transform based (BRT, BDCT, BFFT and BDT), textural and statistical techniques. Finally using a GAS (Greedy Attribute Selection) method 20 attributes are selected for learning process. Total 600 unconstrained document image blocks of size 512×512 each, are prepared with equal distribution of each script type. The whole dataset is divided into 2:1 ratio for training and testing. Extensive experimentation is carried out for Six-scripts, Tetra-scripts, Tri-scripts and Bi-scripts combinations. Experimental result shows promising and comparable performance.\",\"PeriodicalId\":161306,\"journal\":{\"name\":\"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)\",\"volume\":\"71 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ReTIS.2015.7232880\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ReTIS.2015.7232880","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

摘要

在像印度这样的多脚本国家，在选择合适的脚本特定OCR之前，从文档图像中预先识别脚本是必不可少的步骤。在HSI(手写体识别)的情况下，这个问题变得更加复杂和具有挑战性。本文提出了一种针对孟加拉语、德文加里语、马拉雅拉姆语、奥里亚语、罗马语和乌尔都语六种常用印度文字文档图像的自动HSI技术。使用基于变换(BRT, BDCT, BFFT和BDT)，纹理和统计技术构建了最初的34维特征向量。最后采用GAS (Greedy Attribute Selection)方法选取20个属性进行学习。总共600个不受约束的文档图像块，每个大小为512×512，每个脚本类型的分布相等。将整个数据集分成2:1的比例进行训练和测试。对六字、四字、三字和双字组合进行了广泛的实验。实验结果表明，该系统具有良好的性能和可比性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Indic script identification from handwritten document images — An unconstrained block-level approach

In a multi-script country like India, prior identification of script from document images is an essential step before choosing appropriate script specific OCR. The problem becomes more complex and challenging in case of HSI (Handwritten Script Identification). An automatic HSI technique for document images of six popular Indic scripts namely Bangla, Devanagari, Malayalam, Oriya, Roman and Urdu is proposed in this paper. A Block-level approach is followed for the same and initially 34-dimensional feature vector is constructed applying transform based (BRT, BDCT, BFFT and BDT), textural and statistical techniques. Finally using a GAS (Greedy Attribute Selection) method 20 attributes are selected for learning process. Total 600 unconstrained document image blocks of size 512×512 each, are prepared with equal distribution of each script type. The whole dataset is divided into 2:1 ratio for training and testing. Extensive experimentation is carried out for Six-scripts, Tetra-scripts, Tri-scripts and Bi-scripts combinations. Experimental result shows promising and comparable performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)

自引率

0.00%

发文量