Content-level Annotation of Large Collection of Printed Document Images

Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Pub Date : 2007-09-23 DOI:10.1109/ICDAR.2007.89

Anand Kumar, C. V. Jawahar

引用次数: 43

Abstract

A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is laborious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed document images. We align document images with independently keyed-in text. The method is model-driven and is intended to annotate large collection of documents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation information. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other document understanding tasks.

查看原文本刊更多论文

大型打印文档图像集合的内容级注释

大型标注语料库对于鲁棒光学字符识别器的开发至关重要。然而，创建带注释的语料库是一项乏味的任务。这很费力，特别是当注释是字符级别时。在本文中，我们提出了一种有效的分层方法来标注大量打印文档图像。我们将文档图像与独立键入的文本对齐。该方法是模型驱动的，目的是在字符级别对以三种不同分辨率扫描的大量文档进行注释。我们使用XML表示来存储注释信息。api提供了内容级别的访问，以便在ocr的培训和评估以及其他文档理解任务中轻松使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)

自引率

0.00%

发文量