Rana S. M. Saad, Randa I. Elanwar, N. A. Kader, S. Mashali, Margrit Betke
{"title":"ASAR 2018 Layout Analysis Challenge: Using Random Forests to Analyze Scanned Arabic Books","authors":"Rana S. M. Saad, Randa I. Elanwar, N. A. Kader, S. Mashali, Margrit Betke","doi":"10.1109/ASAR.2018.8480330","DOIUrl":null,"url":null,"abstract":"Physical Layout Analysis (PLA) is a necessary step to recognize the contents of a digital document. PLA includes segmenting the document image and identifying the content type of the segments. PLA for digitized Arabic documents is challenging due to the nature of the Arabic script. In this paper, we introduce a PLA system for Arabic documents that were digitized by scanning. Our system RFAAD, short for \"Random Forests for Analyzing Arabic Documents,\" starts with morphological preprocessing of the digitized hard copy and then extracts geometrical, shape, and context features to identify the connected components (CC) of the digital image as containing text or non-text. Random forests are trained using the first dataset release of a large data collection project, BCE-Arabic-v1 [22]. Our system shows strong performance on BCE data in terms of CC classification accuracy and F1-score (97.5% and 97.7% respectively). When evaluated on datasets by other researchers [2], [11], RFAAD also performs well. Moreover, RFAAD shows moderately strong performance when applied to the most challenging layouts of the benchmarking dataset of the ASAR 2018 competition PLA-SAB.1 The performance of RFAAD suggests that our work, with some modifications, has the potential to solve other open problems in the document analysis area and attain a relatively high degree of generalization.","PeriodicalId":165564,"journal":{"name":"2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAR.2018.8480330","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Physical Layout Analysis (PLA) is a necessary step to recognize the contents of a digital document. PLA includes segmenting the document image and identifying the content type of the segments. PLA for digitized Arabic documents is challenging due to the nature of the Arabic script. In this paper, we introduce a PLA system for Arabic documents that were digitized by scanning. Our system RFAAD, short for "Random Forests for Analyzing Arabic Documents," starts with morphological preprocessing of the digitized hard copy and then extracts geometrical, shape, and context features to identify the connected components (CC) of the digital image as containing text or non-text. Random forests are trained using the first dataset release of a large data collection project, BCE-Arabic-v1 [22]. Our system shows strong performance on BCE data in terms of CC classification accuracy and F1-score (97.5% and 97.7% respectively). When evaluated on datasets by other researchers [2], [11], RFAAD also performs well. Moreover, RFAAD shows moderately strong performance when applied to the most challenging layouts of the benchmarking dataset of the ASAR 2018 competition PLA-SAB.1 The performance of RFAAD suggests that our work, with some modifications, has the potential to solve other open problems in the document analysis area and attain a relatively high degree of generalization.