纳斯塔利克乌尔都语结扎的识别

MOCR '13 Pub Date : 2013-08-24 DOI:10.1145/2505377.2505379
Gurpreet Singh Lehal, Ankur Rana
{"title":"纳斯塔利克乌尔都语结扎的识别","authors":"Gurpreet Singh Lehal, Ankur Rana","doi":"10.1145/2505377.2505379","DOIUrl":null,"url":null,"abstract":"There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a highly challenging task, so most of the researchers avoid the character segmentation phase and go in for higher unit of recognition. For Urdu, the next higher recognition unit considered by researchers is ligature, which lies between character and word. A ligature is a connected component of one or more characters and usually an Urdu word is composed of 1 to 8 ligatures. There are more than 25,000 Urdu ligatures, out of which top 4567 ligatures account for 99% of coverage. From OCR point of view, a ligature can further be segmented into one primary connected component and zero or more secondary connected components. The primary component represents the basic shape of the ligature, while the secondary connected component corresponds to the dots and diacritics marks and special symbols associated with the ligature. To reduce the class count, the ligatures with similar primary components are clubbed together. In this paper, we have presented a system to recognize 9262 ligatures formed from 2190 primary and 17 secondary components. Various combinations of DCT, Gabor filters and zoning based features along with kNN, HMM and SVM classifiers have been tried and a recognition accuracy of 98% has been reported on pre-segmented ligatures.","PeriodicalId":288465,"journal":{"name":"MOCR '13","volume":"119 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":"{\"title\":\"Recognition of Nastalique Urdu ligatures\",\"authors\":\"Gurpreet Singh Lehal, Ankur Rana\",\"doi\":\"10.1145/2505377.2505379\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a highly challenging task, so most of the researchers avoid the character segmentation phase and go in for higher unit of recognition. For Urdu, the next higher recognition unit considered by researchers is ligature, which lies between character and word. A ligature is a connected component of one or more characters and usually an Urdu word is composed of 1 to 8 ligatures. There are more than 25,000 Urdu ligatures, out of which top 4567 ligatures account for 99% of coverage. From OCR point of view, a ligature can further be segmented into one primary connected component and zero or more secondary connected components. The primary component represents the basic shape of the ligature, while the secondary connected component corresponds to the dots and diacritics marks and special symbols associated with the ligature. To reduce the class count, the ligatures with similar primary components are clubbed together. In this paper, we have presented a system to recognize 9262 ligatures formed from 2190 primary and 17 secondary components. Various combinations of DCT, Gabor filters and zoning based features along with kNN, HMM and SVM classifiers have been tried and a recognition accuracy of 98% has been reported on pre-segmented ligatures.\",\"PeriodicalId\":288465,\"journal\":{\"name\":\"MOCR '13\",\"volume\":\"119 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-08-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"28\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"MOCR '13\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2505377.2505379\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"MOCR '13","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2505377.2505379","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 28

摘要

在阿拉伯语OCR方面已经有相当多的工作。然而,所有这些工作都是基于Naskh风格。乌尔都语以阿拉伯字母为基础,但使用纳斯塔利克字体。Nastalique风格使得OCR,特别是字符分割成为一项极具挑战性的任务,因此大多数研究者都避开了字符分割阶段,而转向更高的识别单元。对于乌尔都语,研究人员考虑的下一个更高的识别单位是词缀,它位于字符和单词之间。一个连词是一个或多个字符的连接组成部分,通常一个乌尔都语单词由1到8个连词组成。乌尔都语有25000多个结扎词,其中排名前4567的结扎词覆盖率达到99%。从OCR的角度来看,连接可以进一步分割为一个主连接组件和零个或多个次连接组件。主要成分代表结扎的基本形状,而次要连接成分对应于与结扎相关的点和变音符标记和特殊符号。为了减少类数,将具有相似主组件的连接组合在一起。在本文中,我们提出了一个识别由2190个主分量和17个次分量组成的9262个连接的系统。已经尝试了DCT、Gabor滤波器和基于分区的特征以及kNN、HMM和SVM分类器的各种组合,并报道了对预分割的连接的识别准确率为98%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Recognition of Nastalique Urdu ligatures
There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a highly challenging task, so most of the researchers avoid the character segmentation phase and go in for higher unit of recognition. For Urdu, the next higher recognition unit considered by researchers is ligature, which lies between character and word. A ligature is a connected component of one or more characters and usually an Urdu word is composed of 1 to 8 ligatures. There are more than 25,000 Urdu ligatures, out of which top 4567 ligatures account for 99% of coverage. From OCR point of view, a ligature can further be segmented into one primary connected component and zero or more secondary connected components. The primary component represents the basic shape of the ligature, while the secondary connected component corresponds to the dots and diacritics marks and special symbols associated with the ligature. To reduce the class count, the ligatures with similar primary components are clubbed together. In this paper, we have presented a system to recognize 9262 ligatures formed from 2190 primary and 17 secondary components. Various combinations of DCT, Gabor filters and zoning based features along with kNN, HMM and SVM classifiers have been tried and a recognition accuracy of 98% has been reported on pre-segmented ligatures.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信