一种快速有效的OCR文档分割方法

Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No.01CH37239) Pub Date : 2001-08-19 DOI:10.1109/TENCON.2001.949618

B. Kruatrachue, P. Suthaphan

{"title":"一种快速有效的OCR文档分割方法","authors":"B. Kruatrachue, P. Suthaphan","doi":"10.1109/TENCON.2001.949618","DOIUrl":null,"url":null,"abstract":"This paper describes fast and efficient method for page segmentation of a document containing a nonrectangular block. The presented method is based on a mixed top-down and bottom-up approach to document analysis. The segmentation is based on a column block (paragraph) extracted by a modified edge following algorithm. Instead of a pixel, a window of 32 by 32 pixel is used in the algorithm so that a paragraph can be extracted instead of a character. The document is scanned at 300 dpi and it is possible to extract more than one column into a block. Then, characters in the block are extracted using the edge following algorithm and their boundaries are used to detect multicolumn cases (bottom-up). Since the block extraction only scans through border pixels of paragraphs and characters need to be extracted in the OCR process, this algorithm is faster with fewer overheads than algorithms that need to access all pixels of a document.","PeriodicalId":358168,"journal":{"name":"Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No.01CH37239)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"A fast and efficient method for document segmentation for OCR\",\"authors\":\"B. Kruatrachue, P. Suthaphan\",\"doi\":\"10.1109/TENCON.2001.949618\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes fast and efficient method for page segmentation of a document containing a nonrectangular block. The presented method is based on a mixed top-down and bottom-up approach to document analysis. The segmentation is based on a column block (paragraph) extracted by a modified edge following algorithm. Instead of a pixel, a window of 32 by 32 pixel is used in the algorithm so that a paragraph can be extracted instead of a character. The document is scanned at 300 dpi and it is possible to extract more than one column into a block. Then, characters in the block are extracted using the edge following algorithm and their boundaries are used to detect multicolumn cases (bottom-up). Since the block extraction only scans through border pixels of paragraphs and characters need to be extracted in the OCR process, this algorithm is faster with fewer overheads than algorithms that need to access all pixels of a document.\",\"PeriodicalId\":358168,\"journal\":{\"name\":\"Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No.01CH37239)\",\"volume\":\"36 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2001-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No.01CH37239)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TENCON.2001.949618\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No.01CH37239)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TENCON.2001.949618","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

本文描述了一种快速有效的包含非矩形块的文档页面分割方法。所提出的方法是基于自顶向下和自底向上混合的文档分析方法。该分割基于改进的边缘跟踪算法提取的列块(段落)。算法中使用了一个32 × 32像素的窗口，而不是一个像素，这样就可以提取一个段落而不是一个字符。文档以300dpi扫描，并且可以将多个列提取到一个块中。然后，使用边缘跟踪算法提取块中的字符，并使用其边界检测多列情况(自下而上)。由于块提取只扫描段落的边框像素和OCR过程中需要提取的字符，因此该算法比需要访问文档的所有像素的算法更快，开销更少。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A fast and efficient method for document segmentation for OCR

This paper describes fast and efficient method for page segmentation of a document containing a nonrectangular block. The presented method is based on a mixed top-down and bottom-up approach to document analysis. The segmentation is based on a column block (paragraph) extracted by a modified edge following algorithm. Instead of a pixel, a window of 32 by 32 pixel is used in the algorithm so that a paragraph can be extracted instead of a character. The document is scanned at 300 dpi and it is possible to extract more than one column into a block. Then, characters in the block are extracted using the edge following algorithm and their boundaries are used to detect multicolumn cases (bottom-up). Since the block extraction only scans through border pixels of paragraphs and characters need to be extracted in the OCR process, this algorithm is faster with fewer overheads than algorithms that need to access all pixels of a document.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No.01CH37239)

自引率

0.00%

发文量