A System for Handwritten and Machine-Printed Text Separation in Bangla Document Images

2012 International Conference on Frontiers in Handwriting Recognition Pub Date : 2012-09-18 DOI:10.1109/ICFHR.2012.171

P. Banerjee, B. Chaudhuri

引用次数: 16

Abstract

In this paper, we describe an approach to distinguish between hand-written text and machine-printed text from annotated machine-printed Bangla Documents images. In applications involving OCR, distinction of machine-printed and hand-written characters is important, so that they can be sent to separate recognition engines. Identification of hand-written parts is useful in deleting those parts and cleaning the document image as well. In this paper a classification system is presented which takes a connected component in the document image and assigns them to two classes namely "machine-printed" and for "hand-written" classes, respectively. The proposed system contains a preprocessing step, which smoothes the object border and finds the Connected Component. Bangla script specific features are extracted from that Connected Component image, and a standard classifier based on SVM generates the final response. Experimental results on a data set show that the proposed approach achieves an overall accuracy of 96.49%.

查看原文本刊更多论文

孟加拉文文件图像中手写与机印文字分离系统

在本文中，我们描述了一种区分手写体文本和机器打印文本的方法，这些文本来自带注释的机器打印孟加拉语文档图像。在涉及OCR的应用中，区分机器打印和手写的字符是很重要的，这样它们就可以被发送到不同的识别引擎。识别手写部分对于删除这些部分和清理文档图像也很有用。本文提出了一种分类系统，该系统将文档图像中的一个连接部件分别划分为两类，即“机印”类和“手写”类。该系统包含一个预处理步骤，该步骤平滑对象边界并找到连接组件。从该Connected Component图像中提取孟加拉语脚本特定特征，并基于SVM的标准分类器生成最终响应。在数据集上的实验结果表明，该方法的总体准确率为96.49%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 International Conference on Frontiers in Handwriting Recognition

自引率

0.00%

发文量