{"title":"Classification of Text regions in a Document Image by Analyzing the properties of Connected Components","authors":"Showmik Bhowmik, R. Sarkar","doi":"10.1109/ASPCON49795.2020.9276688","DOIUrl":null,"url":null,"abstract":"Document layout analysis is a mandatory step in order to develop an effective and complete document image processing system. In this step, an input document image is segmented into different regions. These regions are then classified as text or non-text. The non-text regions are further classified into different sub-classes like table, image, separator, graphic, chart etc., whereas text regions are classified as title, paragraph, header, footer, caption, drop-capital etc. In this paper, a connected component analysis based method is presented to classify a particular text region as title, heading, paragraph, drop-capital, header or footer. In doing so, the positional and size-based information of a region along with its alignment property is applied. To showcase the effectiveness of this method, the output of the segmentation system BINYAS is provided both with and without the present text region classification module for some sample document images taken from RDCL 2017 dataset.","PeriodicalId":193814,"journal":{"name":"2020 IEEE Applied Signal Processing Conference (ASPCON)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Applied Signal Processing Conference (ASPCON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASPCON49795.2020.9276688","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Document layout analysis is a mandatory step in order to develop an effective and complete document image processing system. In this step, an input document image is segmented into different regions. These regions are then classified as text or non-text. The non-text regions are further classified into different sub-classes like table, image, separator, graphic, chart etc., whereas text regions are classified as title, paragraph, header, footer, caption, drop-capital etc. In this paper, a connected component analysis based method is presented to classify a particular text region as title, heading, paragraph, drop-capital, header or footer. In doing so, the positional and size-based information of a region along with its alignment property is applied. To showcase the effectiveness of this method, the output of the segmentation system BINYAS is provided both with and without the present text region classification module for some sample document images taken from RDCL 2017 dataset.