Challenges and Preprocessing Recommendations for MADCAT Dataset of Handwritten Arabic Documents

2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) Pub Date : 2018-10-01 DOI:10.1109/CISP-BMEI.2018.8633103

Gheith A. Abandah, Ahmad S. Ai-Hourani

引用次数: 1

Abstract

In this paper, we analyze the dataset often used in training and testing Arabic handwritten document recognition systems, the Multilingual Automatic Document Classification Analysis and Translation dataset (MADCAT). We report here the main challenges present in MADCAT that the preprocessing stage of any recognition algorithm faces and affect the performance of the systems that use it for training and testing. MADCAT is a representative dataset of Arabic handwritten documents and investigating its challenges helps to identify the requirements of the preprocessing stage. After presenting these challenges, we review the literature and recommend preprocessing algorithms suitable to preprocess this dataset for handwritten Arabic word recognition systems such as JU-OCR2.

查看原文本刊更多论文

MADCAT阿拉伯语手写文档数据集的挑战与预处理建议

在本文中，我们分析了经常用于训练和测试阿拉伯手写文档识别系统的数据集，即多语言自动文档分类分析和翻译数据集(MADCAT)。我们在这里报告了MADCAT中存在的主要挑战，这些挑战是任何识别算法的预处理阶段所面临的，并会影响使用它进行训练和测试的系统的性能。MADCAT是阿拉伯语手写文档的代表性数据集，调查其挑战有助于确定预处理阶段的需求。在提出这些挑战之后，我们回顾了文献，并推荐了适用于手写阿拉伯语单词识别系统(如JU-OCR2)预处理该数据集的预处理算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)

自引率

0.00%

发文量