大型打印文档图像集合的内容级注释

Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Pub Date : 2007-09-23 DOI:10.1109/ICDAR.2007.89

Anand Kumar, C. V. Jawahar

{"title":"大型打印文档图像集合的内容级注释","authors":"Anand Kumar, C. V. Jawahar","doi":"10.1109/ICDAR.2007.89","DOIUrl":null,"url":null,"abstract":"A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is laborious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed document images. We align document images with independently keyed-in text. The method is model-driven and is intended to annotate large collection of documents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation information. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other document understanding tasks.","PeriodicalId":279268,"journal":{"name":"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"43","resultStr":"{\"title\":\"Content-level Annotation of Large Collection of Printed Document Images\",\"authors\":\"Anand Kumar, C. V. Jawahar\",\"doi\":\"10.1109/ICDAR.2007.89\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is laborious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed document images. We align document images with independently keyed-in text. The method is model-driven and is intended to annotate large collection of documents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation information. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other document understanding tasks.\",\"PeriodicalId\":279268,\"journal\":{\"name\":\"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"43\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDAR.2007.89\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2007.89","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 43

摘要

大型标注语料库对于鲁棒光学字符识别器的开发至关重要。然而，创建带注释的语料库是一项乏味的任务。这很费力，特别是当注释是字符级别时。在本文中，我们提出了一种有效的分层方法来标注大量打印文档图像。我们将文档图像与独立键入的文本对齐。该方法是模型驱动的，目的是在字符级别对以三种不同分辨率扫描的大量文档进行注释。我们使用XML表示来存储注释信息。api提供了内容级别的访问，以便在ocr的培训和评估以及其他文档理解任务中轻松使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Content-level Annotation of Large Collection of Printed Document Images

A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is laborious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed document images. We align document images with independently keyed-in text. The method is model-driven and is intended to annotate large collection of documents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation information. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other document understanding tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)

自引率

0.00%

发文量