Content-level Annotation of Large Collection of Printed Document Images

Anand Kumar, C. V. Jawahar
{"title":"Content-level Annotation of Large Collection of Printed Document Images","authors":"Anand Kumar, C. V. Jawahar","doi":"10.1109/ICDAR.2007.89","DOIUrl":null,"url":null,"abstract":"A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is laborious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed document images. We align document images with independently keyed-in text. The method is model-driven and is intended to annotate large collection of documents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation information. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other document understanding tasks.","PeriodicalId":279268,"journal":{"name":"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"43","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2007.89","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 43

Abstract

A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is laborious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed document images. We align document images with independently keyed-in text. The method is model-driven and is intended to annotate large collection of documents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation information. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other document understanding tasks.
大型打印文档图像集合的内容级注释
大型标注语料库对于鲁棒光学字符识别器的开发至关重要。然而,创建带注释的语料库是一项乏味的任务。这很费力,特别是当注释是字符级别时。在本文中,我们提出了一种有效的分层方法来标注大量打印文档图像。我们将文档图像与独立键入的文本对齐。该方法是模型驱动的,目的是在字符级别对以三种不同分辨率扫描的大量文档进行注释。我们使用XML表示来存储注释信息。api提供了内容级别的访问,以便在ocr的培训和评估以及其他文档理解任务中轻松使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信