Indic script identification from handwritten document images — An unconstrained block-level approach

S. Obaidullah, N. Das, C. Halder, K. Roy
{"title":"Indic script identification from handwritten document images — An unconstrained block-level approach","authors":"S. Obaidullah, N. Das, C. Halder, K. Roy","doi":"10.1109/ReTIS.2015.7232880","DOIUrl":null,"url":null,"abstract":"In a multi-script country like India, prior identification of script from document images is an essential step before choosing appropriate script specific OCR. The problem becomes more complex and challenging in case of HSI (Handwritten Script Identification). An automatic HSI technique for document images of six popular Indic scripts namely Bangla, Devanagari, Malayalam, Oriya, Roman and Urdu is proposed in this paper. A Block-level approach is followed for the same and initially 34-dimensional feature vector is constructed applying transform based (BRT, BDCT, BFFT and BDT), textural and statistical techniques. Finally using a GAS (Greedy Attribute Selection) method 20 attributes are selected for learning process. Total 600 unconstrained document image blocks of size 512×512 each, are prepared with equal distribution of each script type. The whole dataset is divided into 2:1 ratio for training and testing. Extensive experimentation is carried out for Six-scripts, Tetra-scripts, Tri-scripts and Bi-scripts combinations. Experimental result shows promising and comparable performance.","PeriodicalId":161306,"journal":{"name":"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ReTIS.2015.7232880","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16

Abstract

In a multi-script country like India, prior identification of script from document images is an essential step before choosing appropriate script specific OCR. The problem becomes more complex and challenging in case of HSI (Handwritten Script Identification). An automatic HSI technique for document images of six popular Indic scripts namely Bangla, Devanagari, Malayalam, Oriya, Roman and Urdu is proposed in this paper. A Block-level approach is followed for the same and initially 34-dimensional feature vector is constructed applying transform based (BRT, BDCT, BFFT and BDT), textural and statistical techniques. Finally using a GAS (Greedy Attribute Selection) method 20 attributes are selected for learning process. Total 600 unconstrained document image blocks of size 512×512 each, are prepared with equal distribution of each script type. The whole dataset is divided into 2:1 ratio for training and testing. Extensive experimentation is carried out for Six-scripts, Tetra-scripts, Tri-scripts and Bi-scripts combinations. Experimental result shows promising and comparable performance.
从手写文档图像中识别印度文字——一种不受约束的块级方法
在像印度这样的多脚本国家,在选择合适的脚本特定OCR之前,从文档图像中预先识别脚本是必不可少的步骤。在HSI(手写体识别)的情况下,这个问题变得更加复杂和具有挑战性。本文提出了一种针对孟加拉语、德文加里语、马拉雅拉姆语、奥里亚语、罗马语和乌尔都语六种常用印度文字文档图像的自动HSI技术。使用基于变换(BRT, BDCT, BFFT和BDT),纹理和统计技术构建了最初的34维特征向量。最后采用GAS (Greedy Attribute Selection)方法选取20个属性进行学习。总共600个不受约束的文档图像块,每个大小为512×512,每个脚本类型的分布相等。将整个数据集分成2:1的比例进行训练和测试。对六字、四字、三字和双字组合进行了广泛的实验。实验结果表明,该系统具有良好的性能和可比性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信