Iterated Document Content Classification

Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Pub Date : 2007-09-23 DOI:10.1109/ICDAR.2007.148

Chang An, H. Baird, Pingping Xiu

{"title":"Iterated Document Content Classification","authors":"Chang An, H. Baird, Pingping Xiu","doi":"10.1109/ICDAR.2007.148","DOIUrl":null,"url":null,"abstract":"We report an improved methodology for training classifiers for document image content extraction, that is, the location and segmentation of regions containing handwriting, machine-printed text, photographs, blank space, etc. Our previous methods classified each individual pixel separately (rather than regions): this avoids the arbitrariness and restrictiveness that result from constraining region shapes (to, e.g., rectangles). However, this policy also allows content classes to vary frequently within small regions, often yielding areas where several content classes are mixed together. This does not reflect the way that real content is organized: typically almost all small local regions are of uniform class. This observation suggested a post-classification methodology which enforces local uniformity without imposing a restricted class of region shapes. We choose features extracted from small local regions (e.g. 4-5 pixels radius) with which we train classifiers that operate on the output of previous classifiers, guided by ground truth. This provides a sequence of post-classifiers, each trained separately on the results of the previous classifier. Experiments on a highly diverse test set of 83 document images show that this method reduces per-pixel classification errors by 23%, and it dramatically increases the occurrence of large contiguous regions of uniform class, thus providing highly usable near-solid 'masks' with which to segment the images into distinct classes. It continues to allow a wide range of complex, non-rectilinear region shapes.","PeriodicalId":279268,"journal":{"name":"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2007.148","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

Abstract

We report an improved methodology for training classifiers for document image content extraction, that is, the location and segmentation of regions containing handwriting, machine-printed text, photographs, blank space, etc. Our previous methods classified each individual pixel separately (rather than regions): this avoids the arbitrariness and restrictiveness that result from constraining region shapes (to, e.g., rectangles). However, this policy also allows content classes to vary frequently within small regions, often yielding areas where several content classes are mixed together. This does not reflect the way that real content is organized: typically almost all small local regions are of uniform class. This observation suggested a post-classification methodology which enforces local uniformity without imposing a restricted class of region shapes. We choose features extracted from small local regions (e.g. 4-5 pixels radius) with which we train classifiers that operate on the output of previous classifiers, guided by ground truth. This provides a sequence of post-classifiers, each trained separately on the results of the previous classifier. Experiments on a highly diverse test set of 83 document images show that this method reduces per-pixel classification errors by 23%, and it dramatically increases the occurrence of large contiguous regions of uniform class, thus providing highly usable near-solid 'masks' with which to segment the images into distinct classes. It continues to allow a wide range of complex, non-rectilinear region shapes.

查看原文本刊更多论文

迭代文档内容分类

我们报告了一种用于文档图像内容提取的训练分类器的改进方法，即包含手写，机器打印文本，照片，空白等的区域的定位和分割。我们之前的方法分别对每个单独的像素(而不是区域)进行分类:这避免了由于约束区域形状(例如矩形)而导致的随意性和限制性。但是，此策略还允许内容类在小区域内频繁变化，通常会产生几个内容类混合在一起的区域。这并没有反映真实内容的组织方式:通常几乎所有小的局部区域都是统一的类。这一观察提出了一种后分类方法，该方法在不强加区域形状的限制类别的情况下强制局部一致性。我们选择从小的局部区域(例如4-5个像素半径)提取的特征，我们用这些特征来训练分类器，这些分类器在ground truth的指导下对先前分类器的输出进行操作。这提供了一系列后分类器，每个后分类器分别在前一个分类器的结果上进行训练。在83张高度多样化的文档图像测试集上进行的实验表明，该方法将每像素的分类误差降低了23%，并且显著增加了统一类别的大型连续区域的出现，从而提供了高度可用的近固体“掩模”，用于将图像分割为不同的类别。它继续允许广泛的复杂，非直线区域形状。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)

自引率

0.00%

发文量