图书馆阿拉伯文化遗产光学字符识别系统QATIP

2016 12th IAPR Workshop on Document Analysis Systems (DAS) Pub Date : 2016-04-11 DOI:10.1109/DAS.2016.81

Felix Stahlberg, S. Vogel

{"title":"图书馆阿拉伯文化遗产光学字符识别系统QATIP","authors":"Felix Stahlberg, S. Vogel","doi":"10.1109/DAS.2016.81","DOIUrl":null,"url":null,"abstract":"Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"QATIP -- An Optical Character Recognition System for Arabic Heritage Collections in Libraries\",\"authors\":\"Felix Stahlberg, S. Vogel\",\"doi\":\"10.1109/DAS.2016.81\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).\",\"PeriodicalId\":197359,\"journal\":{\"name\":\"2016 12th IAPR Workshop on Document Analysis Systems (DAS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 12th IAPR Workshop on Document Analysis Systems (DAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DAS.2016.81\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAS.2016.81","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

如今，商用光学字符识别(OCR)软件在现代阿拉伯语文档的高质量扫描上实现了非常高的准确性。然而，图书馆中的大部分阿拉伯文化遗产收藏通常更具挑战性-例如，由打字文件，早期印刷品和历史手稿组成。在本文中，我们提出了一个面向终端用户的QATIP系统，用于这些文档的OCR。该识别基于Kaldi工具包和复杂的文本图像规范化。本文包含两个主要贡献:首先，我们描述了库的QATIP接口，该接口由用于添加和监视作业的图形用户界面和用于自动访问的web API组成。其次，我们提出了针对连续阿拉伯语OCR的语言建模和结扎建模的新方法。我们在早期印刷品和历史手稿上测试了我们的QATIP系统，并报告了实质性的改进-例如，QATIP的字符错误率为12.6%，而我们实验设置(Tesseract)中最好的OCR产品的错误率为51.8%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

QATIP -- An Optical Character Recognition System for Arabic Heritage Collections in Libraries

Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

自引率

0.00%

发文量