Ligature Analysis-based Urdu OCR Framework

Zaheer Ahmed, Khalid Iqbal, I. Mehmood, M. Ayub
{"title":"Ligature Analysis-based Urdu OCR Framework","authors":"Zaheer Ahmed, Khalid Iqbal, I. Mehmood, M. Ayub","doi":"10.1109/FIT.2017.00023","DOIUrl":null,"url":null,"abstract":"Urdu script belongs to Arabic script which is cursive in nature, written right to left with each word formation from top-right to bottom-left, along complex placement of diacritics. Characters are joined together to make ligature and combination of ligatures make words. In this paper, Nataleeq Urdu OCR framework is proposed consisting of three steps. These steps are normalization and segmentation, feature extraction and classification, and text formation. In Urdu script, last character in any ligature or in isolated form always appears in full shape. Each ligature is classified according to segmented last character by finding similarity co-relation with corresponding one, two and three characters ligature image bank in sequence. Ligature image bank comprising 3500 images, developed during this research, and is used to classify ligatures according to the sequence of characters appearance. The proposed framework provides promising results for Urdu Nastaleeq text recognition with accuracy of 97.4% for isolated characters, 82.3% for two-character ligatures and 80.6% for three-character ligature","PeriodicalId":107273,"journal":{"name":"2017 International Conference on Frontiers of Information Technology (FIT)","volume":"24 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Frontiers of Information Technology (FIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FIT.2017.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Urdu script belongs to Arabic script which is cursive in nature, written right to left with each word formation from top-right to bottom-left, along complex placement of diacritics. Characters are joined together to make ligature and combination of ligatures make words. In this paper, Nataleeq Urdu OCR framework is proposed consisting of three steps. These steps are normalization and segmentation, feature extraction and classification, and text formation. In Urdu script, last character in any ligature or in isolated form always appears in full shape. Each ligature is classified according to segmented last character by finding similarity co-relation with corresponding one, two and three characters ligature image bank in sequence. Ligature image bank comprising 3500 images, developed during this research, and is used to classify ligatures according to the sequence of characters appearance. The proposed framework provides promising results for Urdu Nastaleeq text recognition with accuracy of 97.4% for isolated characters, 82.3% for two-character ligatures and 80.6% for three-character ligature
基于结合力分析的Urdu OCR框架
乌尔都文字属于阿拉伯文字,本质上是草书,从右向左书写,每个词形从右上到左下,沿着复杂的变音符号放置。字连在一起就成了连词,连词的组合就成了词。本文提出了Nataleeq Urdu OCR框架,该框架由三个步骤组成。这些步骤是归一化和分割,特征提取和分类,以及文本形成。在乌尔都语文字中,任何连词或孤立形式的最后一个字符总是以完整的形式出现。通过与相应的一、二、三字连词图像库依次查找相似相关关系,对每个连词进行分类。本研究开发了包含3500张图像的结扎图片库,用于根据字符出现的顺序对结扎进行分类。该框架在乌尔都语Nastaleeq文本识别方面取得了良好的效果,对孤立字符的识别准确率为97.4%,对两字符连接的识别准确率为82.3%,对三字符连接的识别准确率为80.6%
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信