{"title":"Column segmentation by white space pattern matching","authors":"M. Ozaki","doi":"10.1109/ICDAR.1995.598960","DOIUrl":null,"url":null,"abstract":"Model-based column segmentation is described. Sequences of horizontal white space across a column are used as the basic features. Structures of columns in a specific publication are described by two levels of regular expressions: column expressions (CE) and element expressions (EE). Additional spatial constraints for element attributes can be described. A CE represents patterns of element sequences. An EE represents patterns of white space sequences for each element type. Segmentation is performed in three steps: element candidate extraction using EEs, column structure verification using the CE and ranking by comparison with statistical data. Experiments were performed on columns in two different scientific journals. More than 70% of the columns were correctly segmented as the top choice and more than 87% were in the top three choices. When spatial constraints were applied to element attributes, the rate was more than 90%.","PeriodicalId":273519,"journal":{"name":"Proceedings of 3rd International Conference on Document Analysis and Recognition","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1995-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of 3rd International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.1995.598960","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Model-based column segmentation is described. Sequences of horizontal white space across a column are used as the basic features. Structures of columns in a specific publication are described by two levels of regular expressions: column expressions (CE) and element expressions (EE). Additional spatial constraints for element attributes can be described. A CE represents patterns of element sequences. An EE represents patterns of white space sequences for each element type. Segmentation is performed in three steps: element candidate extraction using EEs, column structure verification using the CE and ranking by comparison with statistical data. Experiments were performed on columns in two different scientific journals. More than 70% of the columns were correctly segmented as the top choice and more than 87% were in the top three choices. When spatial constraints were applied to element attributes, the rate was more than 90%.