Extraction of line-word-character segments directly from run-length compressed printed text-documents

2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG) Pub Date : 2013-12-01 DOI:10.1109/NCVPRIPG.2013.6776195

M. Javed, P. Nagabhushan, B. B. Chaudhuri

{"title":"Extraction of line-word-character segments directly from run-length compressed printed text-documents","authors":"M. Javed, P. Nagabhushan, B. B. Chaudhuri","doi":"10.1109/NCVPRIPG.2013.6776195","DOIUrl":null,"url":null,"abstract":"Segmentation of a text-document into lines, words and characters, which is considered to be the crucial preprocessing stage in Optical Character Recognition (OCR) is traditionally carried out on uncompressed documents, although most of the documents in real life are available in compressed form, for the reasons such as transmission and storage efficiency. However, this implies that the compressed image should be decompressed, which indents additional computing resources. This limitation has motivated us to take up research in document image analysis using compressed documents. In this paper, we think in a new way to carry out segmentation at line, word and character level in run-length compressed printed-text-documents. We extract the horizontal projection profile curve from the compressed file and using the local minima points perform line segmentation. However, tracing vertical information which leads to tracking words-characters in a run-length compressed file is not very straight forward. Therefore, we propose a novel technique for carrying out simultaneous word and character segmentation by popping out column runs from each row in an intelligent sequence. The proposed algorithms have been validated with 1101 text-lines, 1409 words and 7582 characters from a data-set of 35 noise and skew free compressed documents of Bengali, Kannada and English Scripts.","PeriodicalId":436402,"journal":{"name":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCVPRIPG.2013.6776195","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

Abstract

Segmentation of a text-document into lines, words and characters, which is considered to be the crucial preprocessing stage in Optical Character Recognition (OCR) is traditionally carried out on uncompressed documents, although most of the documents in real life are available in compressed form, for the reasons such as transmission and storage efficiency. However, this implies that the compressed image should be decompressed, which indents additional computing resources. This limitation has motivated us to take up research in document image analysis using compressed documents. In this paper, we think in a new way to carry out segmentation at line, word and character level in run-length compressed printed-text-documents. We extract the horizontal projection profile curve from the compressed file and using the local minima points perform line segmentation. However, tracing vertical information which leads to tracking words-characters in a run-length compressed file is not very straight forward. Therefore, we propose a novel technique for carrying out simultaneous word and character segmentation by popping out column runs from each row in an intelligent sequence. The proposed algorithms have been validated with 1101 text-lines, 1409 words and 7582 characters from a data-set of 35 noise and skew free compressed documents of Bengali, Kannada and English Scripts.

查看原文本刊更多论文

直接从运行长度压缩的打印文本-文档中提取行-字-字符段

将文本文档分割成行、词、字符是光学字符识别(OCR)中至关重要的预处理阶段，传统上是在未压缩的文档上进行的，尽管现实生活中的大多数文档出于传输和存储效率等原因都是以压缩形式存在的。但是，这意味着压缩后的图像应该解压缩，这会减少额外的计算资源。这种限制促使我们开始研究使用压缩文档进行文档图像分析。本文提出了一种新的方法来实现行、字、字符级的行、字、字符级的压缩打印文本分割。从压缩文件中提取水平投影轮廓曲线，利用局部极小点进行直线分割。但是，跟踪垂直信息导致跟踪运行长度压缩文件中的单词-字符并不是很直接。因此，我们提出了一种新技术，通过在智能序列中从每一行弹出列运行来实现同时进行单词和字符分割。该算法已在35个孟加拉语、卡纳达语和英语文本的压缩文档中进行了1101行、1409个单词和7582个字符的验证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)

自引率

0.00%

发文量