撕裂文档中的文档区域分类

2010 12th International Conference on Frontiers in Handwriting Recognition Pub Date : 2010-11-16 DOI:10.1109/ICFHR.2010.12

S. Chanda, K. Franke, U. Pal

{"title":"撕裂文档中的文档区域分类","authors":"S. Chanda, K. Franke, U. Pal","doi":"10.1109/ICFHR.2010.12","DOIUrl":null,"url":null,"abstract":"Arbitrary orientation and sparse data content are common characteristics of torn document. To ensure accuracy and reliability in computer-based analysis, content-zone segmentation is required. In our previous work, we studied segmentation of handwritten and printed text. A questioned document-piece in the form of an office note, however, might also contain non-text data like logos, graphics, and pictures. Hence a more precise content-zone classification is required. In this paper we propose a two-tier approach for non-text, handwriting and printed text segmentation. The first tier aims to discriminate text and non-text regions. The second tier classifies handwritten and printed text within all text zones identified during the first tier. Gabor features and chain-code features are used in Tier-1 and Tier-2, respectively. By using SVM classifier we successfully identified 97.65% of 31,227 text regions in our current test data. The proposed approach identified 98.69% of printed and 96.39% of handwritten text amongst all identified text regions.","PeriodicalId":335044,"journal":{"name":"2010 12th International Conference on Frontiers in Handwriting Recognition","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Document-Zone Classification in Torn Documents\",\"authors\":\"S. Chanda, K. Franke, U. Pal\",\"doi\":\"10.1109/ICFHR.2010.12\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Arbitrary orientation and sparse data content are common characteristics of torn document. To ensure accuracy and reliability in computer-based analysis, content-zone segmentation is required. In our previous work, we studied segmentation of handwritten and printed text. A questioned document-piece in the form of an office note, however, might also contain non-text data like logos, graphics, and pictures. Hence a more precise content-zone classification is required. In this paper we propose a two-tier approach for non-text, handwriting and printed text segmentation. The first tier aims to discriminate text and non-text regions. The second tier classifies handwritten and printed text within all text zones identified during the first tier. Gabor features and chain-code features are used in Tier-1 and Tier-2, respectively. By using SVM classifier we successfully identified 97.65% of 31,227 text regions in our current test data. The proposed approach identified 98.69% of printed and 96.39% of handwritten text amongst all identified text regions.\",\"PeriodicalId\":335044,\"journal\":{\"name\":\"2010 12th International Conference on Frontiers in Handwriting Recognition\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 12th International Conference on Frontiers in Handwriting Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICFHR.2010.12\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 12th International Conference on Frontiers in Handwriting Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFHR.2010.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

任意方向和稀疏数据内容是撕裂文档的共同特征。为了保证计算机分析的准确性和可靠性，需要对内容区域进行分割。在我们之前的工作中，我们研究了手写和印刷文本的分割。然而，办公室便笺形式的被质疑的文档片段可能还包含非文本数据，如徽标、图形和图片。因此需要更精确的内容区分类。在本文中，我们提出了一种用于非文本、手写和打印文本分割的两层方法。第一层的目的是区分文本和非文本区域。第二层对第一层标识的所有文本区域内的手写和打印文本进行分类。Gabor特征和chain-code特征分别用于Tier-1和Tier-2。通过使用SVM分类器，我们成功识别了当前测试数据中31227个文本区域中的97.65%。该方法在所有已识别的文本区域中识别出98.69%的印刷文本和96.39%的手写文本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Document-Zone Classification in Torn Documents

Arbitrary orientation and sparse data content are common characteristics of torn document. To ensure accuracy and reliability in computer-based analysis, content-zone segmentation is required. In our previous work, we studied segmentation of handwritten and printed text. A questioned document-piece in the form of an office note, however, might also contain non-text data like logos, graphics, and pictures. Hence a more precise content-zone classification is required. In this paper we propose a two-tier approach for non-text, handwriting and printed text segmentation. The first tier aims to discriminate text and non-text regions. The second tier classifies handwritten and printed text within all text zones identified during the first tier. Gabor features and chain-code features are used in Tier-1 and Tier-2, respectively. By using SVM classifier we successfully identified 97.65% of 31,227 text regions in our current test data. The proposed approach identified 98.69% of printed and 96.39% of handwritten text amongst all identified text regions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2010 12th International Conference on Frontiers in Handwriting Recognition

自引率

0.00%

发文量