Document-Zone Classification in Torn Documents

2010 12th International Conference on Frontiers in Handwriting Recognition Pub Date : 2010-11-16 DOI:10.1109/ICFHR.2010.12

S. Chanda, K. Franke, U. Pal

引用次数: 3

Abstract

Arbitrary orientation and sparse data content are common characteristics of torn document. To ensure accuracy and reliability in computer-based analysis, content-zone segmentation is required. In our previous work, we studied segmentation of handwritten and printed text. A questioned document-piece in the form of an office note, however, might also contain non-text data like logos, graphics, and pictures. Hence a more precise content-zone classification is required. In this paper we propose a two-tier approach for non-text, handwriting and printed text segmentation. The first tier aims to discriminate text and non-text regions. The second tier classifies handwritten and printed text within all text zones identified during the first tier. Gabor features and chain-code features are used in Tier-1 and Tier-2, respectively. By using SVM classifier we successfully identified 97.65% of 31,227 text regions in our current test data. The proposed approach identified 98.69% of printed and 96.39% of handwritten text amongst all identified text regions.

查看原文本刊更多论文

撕裂文档中的文档区域分类

任意方向和稀疏数据内容是撕裂文档的共同特征。为了保证计算机分析的准确性和可靠性，需要对内容区域进行分割。在我们之前的工作中，我们研究了手写和印刷文本的分割。然而，办公室便笺形式的被质疑的文档片段可能还包含非文本数据，如徽标、图形和图片。因此需要更精确的内容区分类。在本文中，我们提出了一种用于非文本、手写和打印文本分割的两层方法。第一层的目的是区分文本和非文本区域。第二层对第一层标识的所有文本区域内的手写和打印文本进行分类。Gabor特征和chain-code特征分别用于Tier-1和Tier-2。通过使用SVM分类器，我们成功识别了当前测试数据中31227个文本区域中的97.65%。该方法在所有已识别的文本区域中识别出98.69%的印刷文本和96.39%的手写文本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 12th International Conference on Frontiers in Handwriting Recognition

自引率

0.00%

发文量