A Framework for Word Segmentation in Images using Density-based Clustering

International Conference on Computers and Their Applications Pub Date : 2020-03-09 DOI:10.29007/hq3n

Hui Guo, Qin Ding

{"title":"A Framework for Word Segmentation in Images using Density-based Clustering","authors":"Hui Guo, Qin Ding","doi":"10.29007/hq3n","DOIUrl":null,"url":null,"abstract":"Word recognition is to identify words in images of printed or handwritten documents. It is especially challenging to recognize words from cursive handwriting documents. In this paper, we present a framework of using density-based clustering for word segmentation in printed or handwritten documents, including cursive handwriting. First, we performed various strategies for data preprocessing, including converting images to B/W images, adjusting the tilted images, and removing the background noises. K-means clustering and/or neighborhood density are used in finding parameters for the preprocessing steps. The preprocessing has shown to be very effective. For the word segmentation, we proposed density-based clustering to segment the words using multiple steps, including blurring, plotting, and clustering. We also developed a system for the framework, including preprocessing and clustering functionalities. Our approach works very well for printed documents. It works reasonably well for handwriting documents if words are relatively far from each other. The performance on handwriting documents can be further improved by using line segmentation.","PeriodicalId":264035,"journal":{"name":"International Conference on Computers and Their Applications","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Computers and Their Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.29007/hq3n","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Word recognition is to identify words in images of printed or handwritten documents. It is especially challenging to recognize words from cursive handwriting documents. In this paper, we present a framework of using density-based clustering for word segmentation in printed or handwritten documents, including cursive handwriting. First, we performed various strategies for data preprocessing, including converting images to B/W images, adjusting the tilted images, and removing the background noises. K-means clustering and/or neighborhood density are used in finding parameters for the preprocessing steps. The preprocessing has shown to be very effective. For the word segmentation, we proposed density-based clustering to segment the words using multiple steps, including blurring, plotting, and clustering. We also developed a system for the framework, including preprocessing and clustering functionalities. Our approach works very well for printed documents. It works reasonably well for handwriting documents if words are relatively far from each other. The performance on handwriting documents can be further improved by using line segmentation.

查看原文本刊更多论文

基于密度聚类的图像分词框架

单词识别是识别印刷或手写文件图像中的单词。从草书文档中识别单词尤其具有挑战性。在本文中，我们提出了一个使用基于密度的聚类在打印或手写文档(包括草书)中进行分词的框架。首先，我们执行了各种数据预处理策略，包括将图像转换为B/W图像、调整倾斜图像和去除背景噪声。k均值聚类和/或邻域密度用于寻找预处理步骤的参数。预处理是非常有效的。对于分词，我们提出了基于密度的聚类方法，通过模糊、绘图和聚类等多个步骤对词进行分词。我们还为该框架开发了一个系统，包括预处理和集群功能。我们的方法对打印文档非常有效。对于手写文档，如果单词之间距离相对较远，它的工作效果相当好。使用线分割可以进一步提高手写文档的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Computers and Their Applications

自引率

0.00%

发文量