{"title":"Ligature Analysis-based Urdu OCR Framework","authors":"Zaheer Ahmed, Khalid Iqbal, I. Mehmood, M. Ayub","doi":"10.1109/FIT.2017.00023","DOIUrl":null,"url":null,"abstract":"Urdu script belongs to Arabic script which is cursive in nature, written right to left with each word formation from top-right to bottom-left, along complex placement of diacritics. Characters are joined together to make ligature and combination of ligatures make words. In this paper, Nataleeq Urdu OCR framework is proposed consisting of three steps. These steps are normalization and segmentation, feature extraction and classification, and text formation. In Urdu script, last character in any ligature or in isolated form always appears in full shape. Each ligature is classified according to segmented last character by finding similarity co-relation with corresponding one, two and three characters ligature image bank in sequence. Ligature image bank comprising 3500 images, developed during this research, and is used to classify ligatures according to the sequence of characters appearance. The proposed framework provides promising results for Urdu Nastaleeq text recognition with accuracy of 97.4% for isolated characters, 82.3% for two-character ligatures and 80.6% for three-character ligature","PeriodicalId":107273,"journal":{"name":"2017 International Conference on Frontiers of Information Technology (FIT)","volume":"24 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Frontiers of Information Technology (FIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FIT.2017.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Urdu script belongs to Arabic script which is cursive in nature, written right to left with each word formation from top-right to bottom-left, along complex placement of diacritics. Characters are joined together to make ligature and combination of ligatures make words. In this paper, Nataleeq Urdu OCR framework is proposed consisting of three steps. These steps are normalization and segmentation, feature extraction and classification, and text formation. In Urdu script, last character in any ligature or in isolated form always appears in full shape. Each ligature is classified according to segmented last character by finding similarity co-relation with corresponding one, two and three characters ligature image bank in sequence. Ligature image bank comprising 3500 images, developed during this research, and is used to classify ligatures according to the sequence of characters appearance. The proposed framework provides promising results for Urdu Nastaleeq text recognition with accuracy of 97.4% for isolated characters, 82.3% for two-character ligatures and 80.6% for three-character ligature