Handwritten and Printed Text Identification in Historical Archival Documents

Mahsa Vafaie, O. Bruns, Nastasja Pilz, J. Waitelonis, H. Sack
{"title":"Handwritten and Printed Text Identification in Historical Archival Documents","authors":"Mahsa Vafaie, O. Bruns, Nastasja Pilz, J. Waitelonis, H. Sack","doi":"10.2352/issn.2168-3204.2022.19.1.4","DOIUrl":null,"url":null,"abstract":"Historical archival records present many challenges for OCR systems to correctly encode their content, due to visual complexity, e.g. mixed printed text and handwritten annotations, paper degradation and faded ink. This paper addresses the problem of automatic identification and separation of handwritten and printed text in historical archival documents, including the creation of an artificial pixel-level annotated dataset and the presentation of a new FCN-based model trained on historical data. Initial test results indicate 18% IoU performance improvement on recognition of printed pixels and 10% IoU performance improvement on recognition of handwritten pixels in synthesised data when compared to the state-of-the-art trained on modern documents. Furthermore, an extrinsic OCR-based evaluation on the printed layer extracted from real historical documents shows 26% performance increase.","PeriodicalId":89080,"journal":{"name":"Archiving : final program and proceedings. IS & T's Archiving Conference","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Archiving : final program and proceedings. IS & T's Archiving Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2352/issn.2168-3204.2022.19.1.4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Historical archival records present many challenges for OCR systems to correctly encode their content, due to visual complexity, e.g. mixed printed text and handwritten annotations, paper degradation and faded ink. This paper addresses the problem of automatic identification and separation of handwritten and printed text in historical archival documents, including the creation of an artificial pixel-level annotated dataset and the presentation of a new FCN-based model trained on historical data. Initial test results indicate 18% IoU performance improvement on recognition of printed pixels and 10% IoU performance improvement on recognition of handwritten pixels in synthesised data when compared to the state-of-the-art trained on modern documents. Furthermore, an extrinsic OCR-based evaluation on the printed layer extracted from real historical documents shows 26% performance increase.
历史档案文献中的手写体和印刷体文本识别
由于视觉复杂性,历史档案记录给OCR系统正确编码其内容带来了许多挑战,例如混合打印文本和手写注释、纸张退化和墨水褪色。本文解决了历史档案文件中手写文本和打印文本的自动识别和分离问题,包括创建一个人工像素级注释数据集,以及提出一个基于历史数据训练的新的FCN模型。初步测试结果表明,与在现代文档上训练的最先进技术相比,在合成数据中,打印像素的识别方面,IoU性能提高了18%,手写像素的识别性能提高了10%。此外,对从真实历史文档中提取的打印层进行的基于外部OCR的评估显示,性能提高了26%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信