A Robust Page Frame Detection Method for Complex Historical Document Images

M. Reza, Md. Ajraf Rakib, S. S. Bukhari, A. Dengel
{"title":"A Robust Page Frame Detection Method for Complex Historical Document Images","authors":"M. Reza, Md. Ajraf Rakib, S. S. Bukhari, A. Dengel","doi":"10.5220/0007382405560564","DOIUrl":null,"url":null,"abstract":"Document layout analysis is the most important part of converting scanned page images into search-able full text. An intensive amount of research is going on in the field of structured and semi-structured documents (journal articles, books, magazines, invoices) but not much in historical documents. Historical document digitization is a more challenging task than regular structured documents due to poor image quality, damaged characters, big amount of textual and non-textual noise. In the scientific community, the extraneous symbols from the neighboring page are considered as textual noise, while the appearances of black borders, speckles, ruler, different types of image etc. along the border of the documents are considered as non-textual noise. Existing historical document analysis method cannot handle all of this noise which is a very strong reason of getting undesired texts as a result from the output of Optical Character Recognition (OCR) that needs to be removed afterward with a lot of extra afford. This paper presents a new perspective especially for the historical document image cleanup by detecting the page frame of the document. The goal of this method is to find actual contents area of the document and ignore noises along the page border. We use morphological transforms, the line segment detector, and geometric matching algorithm to find an ideal page frame of the document. After the implementation of page frame method, we also evaluate our approach over 16th-19th century printed historical documents. We have noticed in the result that OCR performance for the historical documents increased by 4.49% after applying our page frame detection method. In addition, we are able to increase the OCR accuracy around 6.69% for contemporary documents too.","PeriodicalId":410036,"journal":{"name":"International Conference on Pattern Recognition Applications and Methods","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Pattern Recognition Applications and Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5220/0007382405560564","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Document layout analysis is the most important part of converting scanned page images into search-able full text. An intensive amount of research is going on in the field of structured and semi-structured documents (journal articles, books, magazines, invoices) but not much in historical documents. Historical document digitization is a more challenging task than regular structured documents due to poor image quality, damaged characters, big amount of textual and non-textual noise. In the scientific community, the extraneous symbols from the neighboring page are considered as textual noise, while the appearances of black borders, speckles, ruler, different types of image etc. along the border of the documents are considered as non-textual noise. Existing historical document analysis method cannot handle all of this noise which is a very strong reason of getting undesired texts as a result from the output of Optical Character Recognition (OCR) that needs to be removed afterward with a lot of extra afford. This paper presents a new perspective especially for the historical document image cleanup by detecting the page frame of the document. The goal of this method is to find actual contents area of the document and ignore noises along the page border. We use morphological transforms, the line segment detector, and geometric matching algorithm to find an ideal page frame of the document. After the implementation of page frame method, we also evaluate our approach over 16th-19th century printed historical documents. We have noticed in the result that OCR performance for the historical documents increased by 4.49% after applying our page frame detection method. In addition, we are able to increase the OCR accuracy around 6.69% for contemporary documents too.
复杂历史文档图像的鲁棒页面帧检测方法
文档布局分析是将扫描页面图像转换为可搜索全文的最重要部分。在结构化和半结构化文件(期刊文章、书籍、杂志、发票)领域正在进行大量的研究,但在历史文件方面的研究不多。历史文献由于图像质量差、字符损坏、文本和非文本噪声大等问题,比常规结构化文献数字化更具挑战性。在科学界,邻页的无关符号被认为是文本噪声,而文档边缘出现的黑色边框、斑点、标尺、不同类型的图像等被认为是非文本噪声。现有的历史文档分析方法不能处理所有这些噪声,这是光学字符识别(OCR)输出中得到不需要的文本的一个非常重要的原因,这些文本需要在事后花费大量额外的费用来去除。本文提出了一种新的视角,特别是通过检测文档的页框来清理历史文档图像。该方法的目标是找到文档的实际内容区域,并忽略沿页面边界的噪声。我们使用形态变换、线段检测器和几何匹配算法来找到文档的理想页面框架。在页面框架方法实施后,我们还对16 -19世纪的印刷历史文献进行了评估。我们注意到,应用我们的页面框架检测方法后,历史文档的OCR性能提高了4.49%。此外,我们也能够将现代文档的OCR准确率提高到6.69%左右。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信