ASAR 2018 Layout Analysis Challenge: Using Random Forests to Analyze Scanned Arabic Books

2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR) Pub Date : 2018-03-12 DOI:10.1109/ASAR.2018.8480330

Rana S. M. Saad, Randa I. Elanwar, N. A. Kader, S. Mashali, Margrit Betke

{"title":"ASAR 2018 Layout Analysis Challenge: Using Random Forests to Analyze Scanned Arabic Books","authors":"Rana S. M. Saad, Randa I. Elanwar, N. A. Kader, S. Mashali, Margrit Betke","doi":"10.1109/ASAR.2018.8480330","DOIUrl":null,"url":null,"abstract":"Physical Layout Analysis (PLA) is a necessary step to recognize the contents of a digital document. PLA includes segmenting the document image and identifying the content type of the segments. PLA for digitized Arabic documents is challenging due to the nature of the Arabic script. In this paper, we introduce a PLA system for Arabic documents that were digitized by scanning. Our system RFAAD, short for \"Random Forests for Analyzing Arabic Documents,\" starts with morphological preprocessing of the digitized hard copy and then extracts geometrical, shape, and context features to identify the connected components (CC) of the digital image as containing text or non-text. Random forests are trained using the first dataset release of a large data collection project, BCE-Arabic-v1 [22]. Our system shows strong performance on BCE data in terms of CC classification accuracy and F1-score (97.5% and 97.7% respectively). When evaluated on datasets by other researchers [2], [11], RFAAD also performs well. Moreover, RFAAD shows moderately strong performance when applied to the most challenging layouts of the benchmarking dataset of the ASAR 2018 competition PLA-SAB.1 The performance of RFAAD suggests that our work, with some modifications, has the potential to solve other open problems in the document analysis area and attain a relatively high degree of generalization.","PeriodicalId":165564,"journal":{"name":"2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAR.2018.8480330","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Physical Layout Analysis (PLA) is a necessary step to recognize the contents of a digital document. PLA includes segmenting the document image and identifying the content type of the segments. PLA for digitized Arabic documents is challenging due to the nature of the Arabic script. In this paper, we introduce a PLA system for Arabic documents that were digitized by scanning. Our system RFAAD, short for "Random Forests for Analyzing Arabic Documents," starts with morphological preprocessing of the digitized hard copy and then extracts geometrical, shape, and context features to identify the connected components (CC) of the digital image as containing text or non-text. Random forests are trained using the first dataset release of a large data collection project, BCE-Arabic-v1 [22]. Our system shows strong performance on BCE data in terms of CC classification accuracy and F1-score (97.5% and 97.7% respectively). When evaluated on datasets by other researchers [2], [11], RFAAD also performs well. Moreover, RFAAD shows moderately strong performance when applied to the most challenging layouts of the benchmarking dataset of the ASAR 2018 competition PLA-SAB.1 The performance of RFAAD suggests that our work, with some modifications, has the potential to solve other open problems in the document analysis area and attain a relatively high degree of generalization.

查看原文本刊更多论文

ASAR 2018布局分析挑战:使用随机森林分析扫描的阿拉伯语书籍

物理布局分析(PLA)是识别数字文档内容的必要步骤。PLA包括对文档图像进行分割和识别片段的内容类型。由于阿拉伯文字的性质，数字化阿拉伯文件的PLA具有挑战性。本文介绍了一种用于扫描数字化阿拉伯文文献的PLA系统。我们的系统RFAAD是“分析阿拉伯文档的随机森林”的缩写，它从数字化硬拷贝的形态学预处理开始，然后提取几何、形状和上下文特征，以识别数字图像中包含文本或非文本的连接组件(CC)。随机森林使用大型数据收集项目BCE-Arabic-v1[22]发布的第一个数据集进行训练。我们的系统在BCE数据上的CC分类准确率和f1分数分别达到了97.5%和97.7%。当其他研究人员在数据集[2]，b[11]上进行评估时，RFAAD也表现良好。此外，当应用于ASAR 2018竞赛pla - sab1中最具挑战性的基准数据集布局时，RFAAD表现出适度强劲的性能RFAAD的性能表明，我们的工作经过一些修改，有可能解决文档分析领域的其他开放问题，并达到相对较高的泛化程度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR)

自引率

0.00%

发文量