ocoract:一个基于孤立字符训练的序列学习OCR系统

2016 12th IAPR Workshop on Document Analysis Systems (DAS) Pub Date : 2016-04-11 DOI:10.1109/DAS.2016.51

A. Ul-Hasan, S. S. Bukhari, A. Dengel

{"title":"ocoract:一个基于孤立字符训练的序列学习OCR系统","authors":"A. Ul-Hasan, S. S. Bukhari, A. Dengel","doi":"10.1109/DAS.2016.51","DOIUrl":null,"url":null,"abstract":"Digitizing historical documents is crucial in preserving the literary heritage. With the availability of low cost capturing devices, libraries and institutes all over the world have old literature preserved in the form of scanned documents. However, searching through these scanned images is still a tedious job as one is unable to search through them. Contemporary machine learning approaches have been applied successfully to recognize text in both printed and handwriting form, however, these approaches require a lot of transcribed training data in order to obtain satisfactory performance. Transcribing the documents manually is a laborious and costly task, requiring many man-hours and language-specific expertise. This paper presents a generic iterative training framework to address this issue. The proposed framework is not only applicable to historical documents, but for present-day documents as well, where manually transcribed training data is unavailable. Starting with the minimal information available, the proposed approach iteratively corrects the training and generalization errors. Specifically, we have used a segmentation-based OCR method to train on individual symbols and then use the semi-corrected recognized text lines as the ground-truth data for segmentation-free sequence learning, which learns to correct the errors in the ground-truth by incorporating context-aware processing. The proposed approach is applied to a collection of 15th century Latin documents. The iterative procedure using segmentation-free OCR was able to reduce the initial character error of about 23% (obtained from segmentation-based OCR) to less than 7% in few iterations.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters\",\"authors\":\"A. Ul-Hasan, S. S. Bukhari, A. Dengel\",\"doi\":\"10.1109/DAS.2016.51\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Digitizing historical documents is crucial in preserving the literary heritage. With the availability of low cost capturing devices, libraries and institutes all over the world have old literature preserved in the form of scanned documents. However, searching through these scanned images is still a tedious job as one is unable to search through them. Contemporary machine learning approaches have been applied successfully to recognize text in both printed and handwriting form, however, these approaches require a lot of transcribed training data in order to obtain satisfactory performance. Transcribing the documents manually is a laborious and costly task, requiring many man-hours and language-specific expertise. This paper presents a generic iterative training framework to address this issue. The proposed framework is not only applicable to historical documents, but for present-day documents as well, where manually transcribed training data is unavailable. Starting with the minimal information available, the proposed approach iteratively corrects the training and generalization errors. Specifically, we have used a segmentation-based OCR method to train on individual symbols and then use the semi-corrected recognized text lines as the ground-truth data for segmentation-free sequence learning, which learns to correct the errors in the ground-truth by incorporating context-aware processing. The proposed approach is applied to a collection of 15th century Latin documents. The iterative procedure using segmentation-free OCR was able to reduce the initial character error of about 23% (obtained from segmentation-based OCR) to less than 7% in few iterations.\",\"PeriodicalId\":197359,\"journal\":{\"name\":\"2016 12th IAPR Workshop on Document Analysis Systems (DAS)\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 12th IAPR Workshop on Document Analysis Systems (DAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DAS.2016.51\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAS.2016.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

摘要

数字化历史文献对保护文学遗产至关重要。随着低成本捕获设备的可用性，世界各地的图书馆和研究所都以扫描文件的形式保存了旧文献。然而，搜索这些扫描图像仍然是一项繁琐的工作，因为人们无法搜索它们。现代机器学习方法已经成功地应用于识别印刷和手写形式的文本，然而，这些方法需要大量的转录训练数据才能获得令人满意的性能。手动抄写文件是一项费力而昂贵的任务，需要大量的工时和特定语言的专业知识。本文提出了一个通用的迭代训练框架来解决这个问题。所建议的框架不仅适用于历史文档，也适用于无法获得人工转录的训练数据的当前文档。该方法从最小可用信息开始，迭代地修正训练和泛化误差。具体来说，我们使用基于分割的OCR方法对单个符号进行训练，然后使用半校正的识别文本行作为无分割序列学习的基础真值数据，该序列学习通过结合上下文感知处理来纠正基础真值中的错误。所提出的方法适用于15世纪拉丁文文献的集合。使用无分割OCR的迭代过程能够在几次迭代中将大约23%的初始字符误差(来自基于分割的OCR)降低到7%以下。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters

Digitizing historical documents is crucial in preserving the literary heritage. With the availability of low cost capturing devices, libraries and institutes all over the world have old literature preserved in the form of scanned documents. However, searching through these scanned images is still a tedious job as one is unable to search through them. Contemporary machine learning approaches have been applied successfully to recognize text in both printed and handwriting form, however, these approaches require a lot of transcribed training data in order to obtain satisfactory performance. Transcribing the documents manually is a laborious and costly task, requiring many man-hours and language-specific expertise. This paper presents a generic iterative training framework to address this issue. The proposed framework is not only applicable to historical documents, but for present-day documents as well, where manually transcribed training data is unavailable. Starting with the minimal information available, the proposed approach iteratively corrects the training and generalization errors. Specifically, we have used a segmentation-based OCR method to train on individual symbols and then use the semi-corrected recognized text lines as the ground-truth data for segmentation-free sequence learning, which learns to correct the errors in the ground-truth by incorporating context-aware processing. The proposed approach is applied to a collection of 15th century Latin documents. The iterative procedure using segmentation-free OCR was able to reduce the initial character error of about 23% (obtained from segmentation-based OCR) to less than 7% in few iterations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

自引率

0.00%

发文量