Document-Zone Classification in Torn Documents

S. Chanda, K. Franke, U. Pal
{"title":"Document-Zone Classification in Torn Documents","authors":"S. Chanda, K. Franke, U. Pal","doi":"10.1109/ICFHR.2010.12","DOIUrl":null,"url":null,"abstract":"Arbitrary orientation and sparse data content are common characteristics of torn document. To ensure accuracy and reliability in computer-based analysis, content-zone segmentation is required. In our previous work, we studied segmentation of handwritten and printed text. A questioned document-piece in the form of an office note, however, might also contain non-text data like logos, graphics, and pictures. Hence a more precise content-zone classification is required. In this paper we propose a two-tier approach for non-text, handwriting and printed text segmentation. The first tier aims to discriminate text and non-text regions. The second tier classifies handwritten and printed text within all text zones identified during the first tier. Gabor features and chain-code features are used in Tier-1 and Tier-2, respectively. By using SVM classifier we successfully identified 97.65% of 31,227 text regions in our current test data. The proposed approach identified 98.69% of printed and 96.39% of handwritten text amongst all identified text regions.","PeriodicalId":335044,"journal":{"name":"2010 12th International Conference on Frontiers in Handwriting Recognition","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 12th International Conference on Frontiers in Handwriting Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFHR.2010.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Arbitrary orientation and sparse data content are common characteristics of torn document. To ensure accuracy and reliability in computer-based analysis, content-zone segmentation is required. In our previous work, we studied segmentation of handwritten and printed text. A questioned document-piece in the form of an office note, however, might also contain non-text data like logos, graphics, and pictures. Hence a more precise content-zone classification is required. In this paper we propose a two-tier approach for non-text, handwriting and printed text segmentation. The first tier aims to discriminate text and non-text regions. The second tier classifies handwritten and printed text within all text zones identified during the first tier. Gabor features and chain-code features are used in Tier-1 and Tier-2, respectively. By using SVM classifier we successfully identified 97.65% of 31,227 text regions in our current test data. The proposed approach identified 98.69% of printed and 96.39% of handwritten text amongst all identified text regions.
撕裂文档中的文档区域分类
任意方向和稀疏数据内容是撕裂文档的共同特征。为了保证计算机分析的准确性和可靠性,需要对内容区域进行分割。在我们之前的工作中,我们研究了手写和印刷文本的分割。然而,办公室便笺形式的被质疑的文档片段可能还包含非文本数据,如徽标、图形和图片。因此需要更精确的内容区分类。在本文中,我们提出了一种用于非文本、手写和打印文本分割的两层方法。第一层的目的是区分文本和非文本区域。第二层对第一层标识的所有文本区域内的手写和打印文本进行分类。Gabor特征和chain-code特征分别用于Tier-1和Tier-2。通过使用SVM分类器,我们成功识别了当前测试数据中31227个文本区域中的97.65%。该方法在所有已识别的文本区域中识别出98.69%的印刷文本和96.39%的手写文本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信