Space Anomalies in OCRs for Arabic Like Scripts

2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR) Pub Date : 2018-03-12 DOI:10.1109/ASAR.2018.8480229

Riaz Ahmad, Muhammad Zeshan Afzal, Sheikh Faisal Rashid, M. Liwicki, A. Dengel

{"title":"Space Anomalies in OCRs for Arabic Like Scripts","authors":"Riaz Ahmad, Muhammad Zeshan Afzal, Sheikh Faisal Rashid, M. Liwicki, A. Dengel","doi":"10.1109/ASAR.2018.8480229","DOIUrl":null,"url":null,"abstract":"This paper investigates and analyses the nature of errors occurring in Optical Character Recognition (OCR) for Arabic-like scripts. Existing research on the area of OCR for Arabic-like scripts often focuses on achieving the best performance in terms of character error rates. Only little effort targets at the analysis of the nature of these errors (anomalies) that may occur. One such important anomaly is Space Anomaly. This anomaly is due to the presence of breaker characters that are an essential part of Arabic-like scripts. The spaces introduced by breaker characters are not depicted in the ground truth making it hard for OCR to generalize. The OCR model either learns to inhibit the original spaces or to generate extra spaces at places where they are not correct. Due to this confusion, the rendering looks sub-optimal. This analyses and removes space anomalies. We present a joint approach that does not only perform OCR but also handles the space anomalies in a robust manner, hence significantly outperforming the state-of-the-art. Although the implication of the work is shown by improved character recognition rate, the impact of this research is much higher in terms of the correctness of the OCR for useful purposes, especially for rendering. The claim is supported by empirical evaluation and it is shown that the proposed approach achieved the best results.","PeriodicalId":165564,"journal":{"name":"2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAR.2018.8480229","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

This paper investigates and analyses the nature of errors occurring in Optical Character Recognition (OCR) for Arabic-like scripts. Existing research on the area of OCR for Arabic-like scripts often focuses on achieving the best performance in terms of character error rates. Only little effort targets at the analysis of the nature of these errors (anomalies) that may occur. One such important anomaly is Space Anomaly. This anomaly is due to the presence of breaker characters that are an essential part of Arabic-like scripts. The spaces introduced by breaker characters are not depicted in the ground truth making it hard for OCR to generalize. The OCR model either learns to inhibit the original spaces or to generate extra spaces at places where they are not correct. Due to this confusion, the rendering looks sub-optimal. This analyses and removes space anomalies. We present a joint approach that does not only perform OCR but also handles the space anomalies in a robust manner, hence significantly outperforming the state-of-the-art. Although the implication of the work is shown by improved character recognition rate, the impact of this research is much higher in terms of the correctness of the OCR for useful purposes, especially for rendering. The claim is supported by empirical evaluation and it is shown that the proposed approach achieved the best results.

查看原文本刊更多论文

类阿拉伯文字ocr中的空间异常

本文对类阿拉伯文字光学字符识别(OCR)中出现的错误进行了研究和分析。对于类阿拉伯文字OCR领域的现有研究往往侧重于在字符错误率方面实现最佳性能。只有很少的工作是针对可能发生的这些错误(异常)的性质进行分析。其中一个重要的异常是空间异常。这种异常是由于中断字符的存在，这些字符是类似阿拉伯语的脚本的重要组成部分。由破断字符引入的空格没有在基真中描述，使得OCR难以泛化。OCR模型要么学会抑制原始空间，要么在不正确的地方生成额外的空间。由于这种混淆，渲染看起来不是最优的。这分析和消除空间异常。我们提出了一种联合方法，不仅执行OCR，而且还以稳健的方式处理空间异常，因此显着优于最先进的技术。虽然这项工作的意义体现在字符识别率的提高上，但这项研究的影响在OCR的正确性方面要高得多，尤其是在渲染方面。实证结果表明，本文提出的方法达到了最佳效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR)

自引率

0.00%

发文量