{"title":"A Framework for Word Segmentation in Images using Density-based Clustering","authors":"Hui Guo, Qin Ding","doi":"10.29007/hq3n","DOIUrl":null,"url":null,"abstract":"Word recognition is to identify words in images of printed or handwritten documents. It is especially challenging to recognize words from cursive handwriting documents. In this paper, we present a framework of using density-based clustering for word segmentation in printed or handwritten documents, including cursive handwriting. First, we performed various strategies for data preprocessing, including converting images to B/W images, adjusting the tilted images, and removing the background noises. K-means clustering and/or neighborhood density are used in finding parameters for the preprocessing steps. The preprocessing has shown to be very effective. For the word segmentation, we proposed density-based clustering to segment the words using multiple steps, including blurring, plotting, and clustering. We also developed a system for the framework, including preprocessing and clustering functionalities. Our approach works very well for printed documents. It works reasonably well for handwriting documents if words are relatively far from each other. The performance on handwriting documents can be further improved by using line segmentation.","PeriodicalId":264035,"journal":{"name":"International Conference on Computers and Their Applications","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Computers and Their Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.29007/hq3n","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Word recognition is to identify words in images of printed or handwritten documents. It is especially challenging to recognize words from cursive handwriting documents. In this paper, we present a framework of using density-based clustering for word segmentation in printed or handwritten documents, including cursive handwriting. First, we performed various strategies for data preprocessing, including converting images to B/W images, adjusting the tilted images, and removing the background noises. K-means clustering and/or neighborhood density are used in finding parameters for the preprocessing steps. The preprocessing has shown to be very effective. For the word segmentation, we proposed density-based clustering to segment the words using multiple steps, including blurring, plotting, and clustering. We also developed a system for the framework, including preprocessing and clustering functionalities. Our approach works very well for printed documents. It works reasonably well for handwriting documents if words are relatively far from each other. The performance on handwriting documents can be further improved by using line segmentation.