Extraction of line-word-character segments directly from run-length compressed printed text-documents

M. Javed, P. Nagabhushan, B. B. Chaudhuri
{"title":"Extraction of line-word-character segments directly from run-length compressed printed text-documents","authors":"M. Javed, P. Nagabhushan, B. B. Chaudhuri","doi":"10.1109/NCVPRIPG.2013.6776195","DOIUrl":null,"url":null,"abstract":"Segmentation of a text-document into lines, words and characters, which is considered to be the crucial preprocessing stage in Optical Character Recognition (OCR) is traditionally carried out on uncompressed documents, although most of the documents in real life are available in compressed form, for the reasons such as transmission and storage efficiency. However, this implies that the compressed image should be decompressed, which indents additional computing resources. This limitation has motivated us to take up research in document image analysis using compressed documents. In this paper, we think in a new way to carry out segmentation at line, word and character level in run-length compressed printed-text-documents. We extract the horizontal projection profile curve from the compressed file and using the local minima points perform line segmentation. However, tracing vertical information which leads to tracking words-characters in a run-length compressed file is not very straight forward. Therefore, we propose a novel technique for carrying out simultaneous word and character segmentation by popping out column runs from each row in an intelligent sequence. The proposed algorithms have been validated with 1101 text-lines, 1409 words and 7582 characters from a data-set of 35 noise and skew free compressed documents of Bengali, Kannada and English Scripts.","PeriodicalId":436402,"journal":{"name":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCVPRIPG.2013.6776195","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30

Abstract

Segmentation of a text-document into lines, words and characters, which is considered to be the crucial preprocessing stage in Optical Character Recognition (OCR) is traditionally carried out on uncompressed documents, although most of the documents in real life are available in compressed form, for the reasons such as transmission and storage efficiency. However, this implies that the compressed image should be decompressed, which indents additional computing resources. This limitation has motivated us to take up research in document image analysis using compressed documents. In this paper, we think in a new way to carry out segmentation at line, word and character level in run-length compressed printed-text-documents. We extract the horizontal projection profile curve from the compressed file and using the local minima points perform line segmentation. However, tracing vertical information which leads to tracking words-characters in a run-length compressed file is not very straight forward. Therefore, we propose a novel technique for carrying out simultaneous word and character segmentation by popping out column runs from each row in an intelligent sequence. The proposed algorithms have been validated with 1101 text-lines, 1409 words and 7582 characters from a data-set of 35 noise and skew free compressed documents of Bengali, Kannada and English Scripts.
直接从运行长度压缩的打印文本-文档中提取行-字-字符段
将文本文档分割成行、词、字符是光学字符识别(OCR)中至关重要的预处理阶段,传统上是在未压缩的文档上进行的,尽管现实生活中的大多数文档出于传输和存储效率等原因都是以压缩形式存在的。但是,这意味着压缩后的图像应该解压缩,这会减少额外的计算资源。这种限制促使我们开始研究使用压缩文档进行文档图像分析。本文提出了一种新的方法来实现行、字、字符级的行、字、字符级的压缩打印文本分割。从压缩文件中提取水平投影轮廓曲线,利用局部极小点进行直线分割。但是,跟踪垂直信息导致跟踪运行长度压缩文件中的单词-字符并不是很直接。因此,我们提出了一种新技术,通过在智能序列中从每一行弹出列运行来实现同时进行单词和字符分割。该算法已在35个孟加拉语、卡纳达语和英语文本的压缩文档中进行了1101行、1409个单词和7582个字符的验证。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信