P. Singh, Sajal Mahanta, Samir Malakar, R. Sarkar, M. Nasipuri
{"title":"Development of a page segmentation technique for Bangla documents printed in italic style","authors":"P. Singh, Sajal Mahanta, Samir Malakar, R. Sarkar, M. Nasipuri","doi":"10.1109/ICBIM.2014.6970950","DOIUrl":null,"url":null,"abstract":"Optical Character Recognition (OCR) is one of the most imperative prerequisites of electronic document analysis systems. Segmentation is the preliminary step of OCR, which has long been an active area of research. In this paper, we present a hierarchical system towards the segmentation of Bangla script document printed in two different styles viz., italic and bold italic with varying fonts and sizes. At first, the text lines are segmented from the document pages. Next, the words are segmented from the extracted text lines. Finally, the characters are segmented from the extracted word images by using a Trapezoidal Fuzzy membership function, which has been used for the detection of Matra region. The proposed technique is tested on 16 document pages consisting of 1456 words. The average success rates of the technique for text line, word and character segmentation are found to be 99.91%, 98.63% and 89.41% respectively.","PeriodicalId":6549,"journal":{"name":"2014 2nd International Conference on Business and Information Management (ICBIM)","volume":"43 1","pages":"120-125"},"PeriodicalIF":0.0000,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 2nd International Conference on Business and Information Management (ICBIM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICBIM.2014.6970950","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
Optical Character Recognition (OCR) is one of the most imperative prerequisites of electronic document analysis systems. Segmentation is the preliminary step of OCR, which has long been an active area of research. In this paper, we present a hierarchical system towards the segmentation of Bangla script document printed in two different styles viz., italic and bold italic with varying fonts and sizes. At first, the text lines are segmented from the document pages. Next, the words are segmented from the extracted text lines. Finally, the characters are segmented from the extracted word images by using a Trapezoidal Fuzzy membership function, which has been used for the detection of Matra region. The proposed technique is tested on 16 document pages consisting of 1456 words. The average success rates of the technique for text line, word and character segmentation are found to be 99.91%, 98.63% and 89.41% respectively.