A Framework for Word Segmentation in Images using Density-based Clustering

Hui Guo, Qin Ding
{"title":"A Framework for Word Segmentation in Images using Density-based Clustering","authors":"Hui Guo, Qin Ding","doi":"10.29007/hq3n","DOIUrl":null,"url":null,"abstract":"Word recognition is to identify words in images of printed or handwritten documents. It is especially challenging to recognize words from cursive handwriting documents. In this paper, we present a framework of using density-based clustering for word segmentation in printed or handwritten documents, including cursive handwriting. First, we performed various strategies for data preprocessing, including converting images to B/W images, adjusting the tilted images, and removing the background noises. K-means clustering and/or neighborhood density are used in finding parameters for the preprocessing steps. The preprocessing has shown to be very effective. For the word segmentation, we proposed density-based clustering to segment the words using multiple steps, including blurring, plotting, and clustering. We also developed a system for the framework, including preprocessing and clustering functionalities. Our approach works very well for printed documents. It works reasonably well for handwriting documents if words are relatively far from each other. The performance on handwriting documents can be further improved by using line segmentation.","PeriodicalId":264035,"journal":{"name":"International Conference on Computers and Their Applications","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Computers and Their Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.29007/hq3n","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Word recognition is to identify words in images of printed or handwritten documents. It is especially challenging to recognize words from cursive handwriting documents. In this paper, we present a framework of using density-based clustering for word segmentation in printed or handwritten documents, including cursive handwriting. First, we performed various strategies for data preprocessing, including converting images to B/W images, adjusting the tilted images, and removing the background noises. K-means clustering and/or neighborhood density are used in finding parameters for the preprocessing steps. The preprocessing has shown to be very effective. For the word segmentation, we proposed density-based clustering to segment the words using multiple steps, including blurring, plotting, and clustering. We also developed a system for the framework, including preprocessing and clustering functionalities. Our approach works very well for printed documents. It works reasonably well for handwriting documents if words are relatively far from each other. The performance on handwriting documents can be further improved by using line segmentation.
基于密度聚类的图像分词框架
单词识别是识别印刷或手写文件图像中的单词。从草书文档中识别单词尤其具有挑战性。在本文中,我们提出了一个使用基于密度的聚类在打印或手写文档(包括草书)中进行分词的框架。首先,我们执行了各种数据预处理策略,包括将图像转换为B/W图像、调整倾斜图像和去除背景噪声。k均值聚类和/或邻域密度用于寻找预处理步骤的参数。预处理是非常有效的。对于分词,我们提出了基于密度的聚类方法,通过模糊、绘图和聚类等多个步骤对词进行分词。我们还为该框架开发了一个系统,包括预处理和集群功能。我们的方法对打印文档非常有效。对于手写文档,如果单词之间距离相对较远,它的工作效果相当好。使用线分割可以进一步提高手写文档的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信