Ligature Analysis-based Urdu OCR Framework

2017 International Conference on Frontiers of Information Technology (FIT) Pub Date : 2017-12-01 DOI:10.1109/FIT.2017.00023

Zaheer Ahmed, Khalid Iqbal, I. Mehmood, M. Ayub

{"title":"Ligature Analysis-based Urdu OCR Framework","authors":"Zaheer Ahmed, Khalid Iqbal, I. Mehmood, M. Ayub","doi":"10.1109/FIT.2017.00023","DOIUrl":null,"url":null,"abstract":"Urdu script belongs to Arabic script which is cursive in nature, written right to left with each word formation from top-right to bottom-left, along complex placement of diacritics. Characters are joined together to make ligature and combination of ligatures make words. In this paper, Nataleeq Urdu OCR framework is proposed consisting of three steps. These steps are normalization and segmentation, feature extraction and classification, and text formation. In Urdu script, last character in any ligature or in isolated form always appears in full shape. Each ligature is classified according to segmented last character by finding similarity co-relation with corresponding one, two and three characters ligature image bank in sequence. Ligature image bank comprising 3500 images, developed during this research, and is used to classify ligatures according to the sequence of characters appearance. The proposed framework provides promising results for Urdu Nastaleeq text recognition with accuracy of 97.4% for isolated characters, 82.3% for two-character ligatures and 80.6% for three-character ligature","PeriodicalId":107273,"journal":{"name":"2017 International Conference on Frontiers of Information Technology (FIT)","volume":"24 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Frontiers of Information Technology (FIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FIT.2017.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Urdu script belongs to Arabic script which is cursive in nature, written right to left with each word formation from top-right to bottom-left, along complex placement of diacritics. Characters are joined together to make ligature and combination of ligatures make words. In this paper, Nataleeq Urdu OCR framework is proposed consisting of three steps. These steps are normalization and segmentation, feature extraction and classification, and text formation. In Urdu script, last character in any ligature or in isolated form always appears in full shape. Each ligature is classified according to segmented last character by finding similarity co-relation with corresponding one, two and three characters ligature image bank in sequence. Ligature image bank comprising 3500 images, developed during this research, and is used to classify ligatures according to the sequence of characters appearance. The proposed framework provides promising results for Urdu Nastaleeq text recognition with accuracy of 97.4% for isolated characters, 82.3% for two-character ligatures and 80.6% for three-character ligature

查看原文本刊更多论文

基于结合力分析的Urdu OCR框架

乌尔都文字属于阿拉伯文字，本质上是草书，从右向左书写，每个词形从右上到左下，沿着复杂的变音符号放置。字连在一起就成了连词，连词的组合就成了词。本文提出了Nataleeq Urdu OCR框架，该框架由三个步骤组成。这些步骤是归一化和分割，特征提取和分类，以及文本形成。在乌尔都语文字中，任何连词或孤立形式的最后一个字符总是以完整的形式出现。通过与相应的一、二、三字连词图像库依次查找相似相关关系，对每个连词进行分类。本研究开发了包含3500张图像的结扎图片库，用于根据字符出现的顺序对结扎进行分类。该框架在乌尔都语Nastaleeq文本识别方面取得了良好的效果，对孤立字符的识别准确率为97.4%，对两字符连接的识别准确率为82.3%，对三字符连接的识别准确率为80.6%

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 International Conference on Frontiers of Information Technology (FIT)

自引率

0.00%

发文量