An approach for printed document labeling

2014 First International Conference on Automation, Control, Energy and Systems (ACES) Pub Date : 2014-05-01 DOI:10.1109/ACES.2014.6808032

Chandranath Adak

引用次数: 0

Abstract

A document image contains texts and non-texts, it may be printed, handwritten, or hybrid of both. In this paper we deal with printed document where textual region is of printed characters, and non-texts are mainly photo images. Here we propose a model which performs labeling of different components of a printed document image, i.e. identification of heading, subheading, caption, article and photo. Our method consists of a preprocessing stage where fuzzy c-means clustering is used to segment the document image into printed (object) region and background. Then Hough transformation is used to find white-line dividers of object region and grid structure examination is used to extract the non-text portion. After that, we use horizontal histogram to find text lines and then we label different components. Our method gives promising results on printed document of different scripts.

查看原文本刊更多论文

一种印刷文件标注方法

文档图像包含文本和非文本，它可以是打印的、手写的或两者的混合。本文研究的是文本区域为印刷字符，非文本区域主要为照片图像的印刷文档。在这里，我们提出了一个模型，该模型对打印文档图像的不同组成部分进行标记，即标题，副标题，标题，文章和照片的识别。我们的方法包括一个预处理阶段，其中使用模糊c均值聚类将文档图像分割为打印(对象)区域和背景。然后利用Hough变换寻找目标区域的白线分割线，利用网格结构检验提取非文本部分。之后，我们使用水平直方图找到文本行，然后我们标记不同的组件。该方法在不同文字的打印文档上取得了令人满意的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 First International Conference on Automation, Control, Energy and Systems (ACES)

自引率

0.00%

发文量