{"title":"Printed Arabic Text Database for Automatic Recognition Systems","authors":"Hassina Bouressace, J. Csirik","doi":"10.1145/3323933.3324082","DOIUrl":null,"url":null,"abstract":"Document image analysis and recognition are important topics in artificial intelligence as they are necessary for the retrieval of documents. Hence, the availability of a database with good script samples is a key requirement for machine-learning processes. Good printed text databases exist for Latin languages. However, there is a lack of databases with Arabic samples. This paper presents a new comprehensive database called PATD (Printed Arabic Text Database), which contains eight hundred and ten images scanned in grayscale format and different resolutions, leading to two thousand and nine hundred and fifty-four images (smartphone-captured images) under varying capture conditions (blurred, at different angles and in different light). It is based on ten newspapers created with different structures, and an open-vocabulary, multi-font, multi-size and multi-style text. The database is described in detail and it is intended for the research community.","PeriodicalId":137904,"journal":{"name":"Proceedings of the 2019 5th International Conference on Computer and Technology Applications","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 5th International Conference on Computer and Technology Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3323933.3324082","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Document image analysis and recognition are important topics in artificial intelligence as they are necessary for the retrieval of documents. Hence, the availability of a database with good script samples is a key requirement for machine-learning processes. Good printed text databases exist for Latin languages. However, there is a lack of databases with Arabic samples. This paper presents a new comprehensive database called PATD (Printed Arabic Text Database), which contains eight hundred and ten images scanned in grayscale format and different resolutions, leading to two thousand and nine hundred and fifty-four images (smartphone-captured images) under varying capture conditions (blurred, at different angles and in different light). It is based on ten newspapers created with different structures, and an open-vocabulary, multi-font, multi-size and multi-style text. The database is described in detail and it is intended for the research community.