MADCAT阿拉伯语手写文档数据集的挑战与预处理建议

2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) Pub Date : 2018-10-01 DOI:10.1109/CISP-BMEI.2018.8633103

Gheith A. Abandah, Ahmad S. Ai-Hourani

{"title":"MADCAT阿拉伯语手写文档数据集的挑战与预处理建议","authors":"Gheith A. Abandah, Ahmad S. Ai-Hourani","doi":"10.1109/CISP-BMEI.2018.8633103","DOIUrl":null,"url":null,"abstract":"In this paper, we analyze the dataset often used in training and testing Arabic handwritten document recognition systems, the Multilingual Automatic Document Classification Analysis and Translation dataset (MADCAT). We report here the main challenges present in MADCAT that the preprocessing stage of any recognition algorithm faces and affect the performance of the systems that use it for training and testing. MADCAT is a representative dataset of Arabic handwritten documents and investigating its challenges helps to identify the requirements of the preprocessing stage. After presenting these challenges, we review the literature and recommend preprocessing algorithms suitable to preprocess this dataset for handwritten Arabic word recognition systems such as JU-OCR2.","PeriodicalId":117227,"journal":{"name":"2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Challenges and Preprocessing Recommendations for MADCAT Dataset of Handwritten Arabic Documents\",\"authors\":\"Gheith A. Abandah, Ahmad S. Ai-Hourani\",\"doi\":\"10.1109/CISP-BMEI.2018.8633103\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we analyze the dataset often used in training and testing Arabic handwritten document recognition systems, the Multilingual Automatic Document Classification Analysis and Translation dataset (MADCAT). We report here the main challenges present in MADCAT that the preprocessing stage of any recognition algorithm faces and affect the performance of the systems that use it for training and testing. MADCAT is a representative dataset of Arabic handwritten documents and investigating its challenges helps to identify the requirements of the preprocessing stage. After presenting these challenges, we review the literature and recommend preprocessing algorithms suitable to preprocess this dataset for handwritten Arabic word recognition systems such as JU-OCR2.\",\"PeriodicalId\":117227,\"journal\":{\"name\":\"2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CISP-BMEI.2018.8633103\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISP-BMEI.2018.8633103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

在本文中，我们分析了经常用于训练和测试阿拉伯手写文档识别系统的数据集，即多语言自动文档分类分析和翻译数据集(MADCAT)。我们在这里报告了MADCAT中存在的主要挑战，这些挑战是任何识别算法的预处理阶段所面临的，并会影响使用它进行训练和测试的系统的性能。MADCAT是阿拉伯语手写文档的代表性数据集，调查其挑战有助于确定预处理阶段的需求。在提出这些挑战之后，我们回顾了文献，并推荐了适用于手写阿拉伯语单词识别系统(如JU-OCR2)预处理该数据集的预处理算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Challenges and Preprocessing Recommendations for MADCAT Dataset of Handwritten Arabic Documents

In this paper, we analyze the dataset often used in training and testing Arabic handwritten document recognition systems, the Multilingual Automatic Document Classification Analysis and Translation dataset (MADCAT). We report here the main challenges present in MADCAT that the preprocessing stage of any recognition algorithm faces and affect the performance of the systems that use it for training and testing. MADCAT is a representative dataset of Arabic handwritten documents and investigating its challenges helps to identify the requirements of the preprocessing stage. After presenting these challenges, we review the literature and recommend preprocessing algorithms suitable to preprocess this dataset for handwritten Arabic word recognition systems such as JU-OCR2.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)

自引率

0.00%

发文量