一种快速有效的OCR文档分割方法

B. Kruatrachue, P. Suthaphan
{"title":"一种快速有效的OCR文档分割方法","authors":"B. Kruatrachue, P. Suthaphan","doi":"10.1109/TENCON.2001.949618","DOIUrl":null,"url":null,"abstract":"This paper describes fast and efficient method for page segmentation of a document containing a nonrectangular block. The presented method is based on a mixed top-down and bottom-up approach to document analysis. The segmentation is based on a column block (paragraph) extracted by a modified edge following algorithm. Instead of a pixel, a window of 32 by 32 pixel is used in the algorithm so that a paragraph can be extracted instead of a character. The document is scanned at 300 dpi and it is possible to extract more than one column into a block. Then, characters in the block are extracted using the edge following algorithm and their boundaries are used to detect multicolumn cases (bottom-up). Since the block extraction only scans through border pixels of paragraphs and characters need to be extracted in the OCR process, this algorithm is faster with fewer overheads than algorithms that need to access all pixels of a document.","PeriodicalId":358168,"journal":{"name":"Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No.01CH37239)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"A fast and efficient method for document segmentation for OCR\",\"authors\":\"B. Kruatrachue, P. Suthaphan\",\"doi\":\"10.1109/TENCON.2001.949618\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes fast and efficient method for page segmentation of a document containing a nonrectangular block. The presented method is based on a mixed top-down and bottom-up approach to document analysis. The segmentation is based on a column block (paragraph) extracted by a modified edge following algorithm. Instead of a pixel, a window of 32 by 32 pixel is used in the algorithm so that a paragraph can be extracted instead of a character. The document is scanned at 300 dpi and it is possible to extract more than one column into a block. Then, characters in the block are extracted using the edge following algorithm and their boundaries are used to detect multicolumn cases (bottom-up). Since the block extraction only scans through border pixels of paragraphs and characters need to be extracted in the OCR process, this algorithm is faster with fewer overheads than algorithms that need to access all pixels of a document.\",\"PeriodicalId\":358168,\"journal\":{\"name\":\"Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No.01CH37239)\",\"volume\":\"36 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2001-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No.01CH37239)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TENCON.2001.949618\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No.01CH37239)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TENCON.2001.949618","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

摘要

本文描述了一种快速有效的包含非矩形块的文档页面分割方法。所提出的方法是基于自顶向下和自底向上混合的文档分析方法。该分割基于改进的边缘跟踪算法提取的列块(段落)。算法中使用了一个32 × 32像素的窗口,而不是一个像素,这样就可以提取一个段落而不是一个字符。文档以300dpi扫描,并且可以将多个列提取到一个块中。然后,使用边缘跟踪算法提取块中的字符,并使用其边界检测多列情况(自下而上)。由于块提取只扫描段落的边框像素和OCR过程中需要提取的字符,因此该算法比需要访问文档的所有像素的算法更快,开销更少。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A fast and efficient method for document segmentation for OCR
This paper describes fast and efficient method for page segmentation of a document containing a nonrectangular block. The presented method is based on a mixed top-down and bottom-up approach to document analysis. The segmentation is based on a column block (paragraph) extracted by a modified edge following algorithm. Instead of a pixel, a window of 32 by 32 pixel is used in the algorithm so that a paragraph can be extracted instead of a character. The document is scanned at 300 dpi and it is possible to extract more than one column into a block. Then, characters in the block are extracted using the edge following algorithm and their boundaries are used to detect multicolumn cases (bottom-up). Since the block extraction only scans through border pixels of paragraphs and characters need to be extracted in the OCR process, this algorithm is faster with fewer overheads than algorithms that need to access all pixels of a document.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信