Robust unsupervised segmentation of degraded document images with topic models

2009 IEEE Conference on Computer Vision and Pattern Recognition Pub Date : 2009-06-20 DOI:10.1109/CVPR.2009.5206606

Timothy J. Burns, Jason J. Corso

{"title":"Robust unsupervised segmentation of degraded document images with topic models","authors":"Timothy J. Burns, Jason J. Corso","doi":"10.1109/CVPR.2009.5206606","DOIUrl":null,"url":null,"abstract":"Segmentation of document images remains a challenging vision problem. Although document images have a structured layout, capturing enough of it for segmentation can be difficult. Most current methods combine text extraction and heuristics for segmentation, but text extraction is prone to failure and measuring accuracy remains a difficult challenge. Furthermore, when presented with significant degradation many common heuristic methods fall apart. In this paper, we propose a Bayesian generative model for document images which seeks to overcome some of these drawbacks. Our model automatically discovers different regions present in a document image in a completely unsupervised fashion. We attempt no text extraction, but rather use discrete patch-based codebook learning to make our probabilistic representation feasible. Each latent region topic is a distribution over these patch indices. We capture rough document layout with an MRF Potts model. We take an analysis by synthesis approach to examine the model, and provide quantitative segmentation results on a manually labeled document image data set. We illustrate our model's robustness by providing results on a highly degraded version of our test set.","PeriodicalId":386532,"journal":{"name":"2009 IEEE Conference on Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2009-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE Conference on Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR.2009.5206606","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Segmentation of document images remains a challenging vision problem. Although document images have a structured layout, capturing enough of it for segmentation can be difficult. Most current methods combine text extraction and heuristics for segmentation, but text extraction is prone to failure and measuring accuracy remains a difficult challenge. Furthermore, when presented with significant degradation many common heuristic methods fall apart. In this paper, we propose a Bayesian generative model for document images which seeks to overcome some of these drawbacks. Our model automatically discovers different regions present in a document image in a completely unsupervised fashion. We attempt no text extraction, but rather use discrete patch-based codebook learning to make our probabilistic representation feasible. Each latent region topic is a distribution over these patch indices. We capture rough document layout with an MRF Potts model. We take an analysis by synthesis approach to examine the model, and provide quantitative segmentation results on a manually labeled document image data set. We illustrate our model's robustness by providing results on a highly degraded version of our test set.

查看原文本刊更多论文

基于主题模型的退化文档图像鲁棒无监督分割

文档图像分割一直是一个具有挑战性的视觉问题。尽管文档图像具有结构化的布局，但捕获足够的图像用于分割可能很困难。目前大多数方法结合了文本提取和启发式分割，但文本提取容易失败，测量精度仍然是一个困难的挑战。此外，当出现显著的退化时，许多常见的启发式方法都失效了。在本文中，我们提出了一个文档图像的贝叶斯生成模型，旨在克服这些缺点。我们的模型以完全无监督的方式自动发现文档图像中存在的不同区域。我们没有尝试文本提取，而是使用离散的基于补丁的码本学习来使我们的概率表示可行。每个潜在区域主题是这些斑块指数的分布。我们使用MRF Potts模型捕获粗略的文档布局。我们采用综合分析的方法来检验模型，并在手动标记的文档图像数据集上提供定量分割结果。我们通过在测试集的高度降级版本上提供结果来说明模型的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 IEEE Conference on Computer Vision and Pattern Recognition

自引率

0.00%

发文量