异构文档集合中的同步布局样式和逻辑实体识别

Siyuan Chen, Song Mao, G. Thoma
{"title":"异构文档集合中的同步布局样式和逻辑实体识别","authors":"Siyuan Chen, Song Mao, G. Thoma","doi":"10.1109/ICDAR.2007.231","DOIUrl":null,"url":null,"abstract":"Logical entity recognition in heterogeneous collections of document page images remains a challenging problem since the performance of traditional supervised methods degrades dramatically in case of many distinct layout styles. In this paper we present an unsupervised method where layout style information is explicitly used in both training and recognition phases. We represent the layout style, local features, and logical labels of physical regions of a document compactly by an ordered labeled X-Y tree. Style dissimilarity of two document pages is represented by the distance between their respective trees. During the training phase, document pages with true logical labels in training set are classified into distinct layout styles by unsupervised clustering. During the recognition phase, the layout style and logical entities of an input document are recognized simultaneously by matching the input tree to the trees in closest- matched layout style cluster of training set. Experimental results show that our algorithm is robust with both balanced and unbalanced style cluster sizes, zone over-segmentation, zone length variation, and variation in tree representations of the same layout style.","PeriodicalId":279268,"journal":{"name":"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents\",\"authors\":\"Siyuan Chen, Song Mao, G. Thoma\",\"doi\":\"10.1109/ICDAR.2007.231\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Logical entity recognition in heterogeneous collections of document page images remains a challenging problem since the performance of traditional supervised methods degrades dramatically in case of many distinct layout styles. In this paper we present an unsupervised method where layout style information is explicitly used in both training and recognition phases. We represent the layout style, local features, and logical labels of physical regions of a document compactly by an ordered labeled X-Y tree. Style dissimilarity of two document pages is represented by the distance between their respective trees. During the training phase, document pages with true logical labels in training set are classified into distinct layout styles by unsupervised clustering. During the recognition phase, the layout style and logical entities of an input document are recognized simultaneously by matching the input tree to the trees in closest- matched layout style cluster of training set. Experimental results show that our algorithm is robust with both balanced and unbalanced style cluster sizes, zone over-segmentation, zone length variation, and variation in tree representations of the same layout style.\",\"PeriodicalId\":279268,\"journal\":{\"name\":\"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)\",\"volume\":\"130 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDAR.2007.231\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2007.231","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

摘要

在异构文档页面图像集合中的逻辑实体识别仍然是一个具有挑战性的问题,因为传统的监督方法在许多不同布局样式的情况下性能显著下降。在本文中,我们提出了一种无监督的方法,其中布局样式信息在训练和识别阶段都被明确地使用。我们通过有序的标记X-Y树紧凑地表示文档的物理区域的布局样式、局部特征和逻辑标签。两个文档页面的样式差异由它们各自树之间的距离表示。在训练阶段,通过无监督聚类将训练集中具有真实逻辑标签的文档页面划分为不同的布局样式。在识别阶段,通过将输入树与训练集最匹配的布局样式簇中的树进行匹配,同时识别输入文档的布局样式和逻辑实体。实验结果表明,该算法对平衡和不平衡样式簇大小、区域过度分割、区域长度变化和相同布局样式的树表示变化都具有鲁棒性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents
Logical entity recognition in heterogeneous collections of document page images remains a challenging problem since the performance of traditional supervised methods degrades dramatically in case of many distinct layout styles. In this paper we present an unsupervised method where layout style information is explicitly used in both training and recognition phases. We represent the layout style, local features, and logical labels of physical regions of a document compactly by an ordered labeled X-Y tree. Style dissimilarity of two document pages is represented by the distance between their respective trees. During the training phase, document pages with true logical labels in training set are classified into distinct layout styles by unsupervised clustering. During the recognition phase, the layout style and logical entities of an input document are recognized simultaneously by matching the input tree to the trees in closest- matched layout style cluster of training set. Experimental results show that our algorithm is robust with both balanced and unbalanced style cluster sizes, zone over-segmentation, zone length variation, and variation in tree representations of the same layout style.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信