Support Vector Machine (SVM) based classifier for Khmer Printed Character-set Recognition

Pongsametrey Sok, Nguonly Taing
{"title":"Support Vector Machine (SVM) based classifier for Khmer Printed Character-set Recognition","authors":"Pongsametrey Sok, Nguonly Taing","doi":"10.1109/APSIPA.2014.7041823","DOIUrl":null,"url":null,"abstract":"This paper describes on the use of Support Vector Machine (SVM) based classification method on Khmer Printed Character-set Recognition (PCR) in bitmap document. Khmer language has been identified as one of the most complex language with the total of 74 alphabets and the wording compound can has up to 5 vertical levels. This paper proposes one new method, SVM for Khmer character classification system by using 3 different SVM kernels (Gaussian, Polynomial and Linear Kernel) on data training and recognition to find out the best kernel for Khmer language. The method allows us to use small training dataset by training different pieces of character training instead of training big amount of clusters. The classification uses binary data of 0 as white space and 1 as black pixel area of the character; each training piece of character has been stretched into a matrix of the binary data in all kinds of image size. Feature extraction is extracted from the matrix to use in SVM classification. After recognition, there are some rules to combine each cluster or character by using character levels or common mistake correction. The experiment of about pure 750 Khmer words or around 3000 characters show that SVM method with Gaussian Kernel produces a good result with better performance among all kernels. The system uses one font \"Khmer OS Content\" of the training data with font size 32pt to recognize 3 different font sizes. The accuracy of 28pt font size is 98.17%, 32pt is 98.62% and 36pt is 98.54% respectively.","PeriodicalId":231382,"journal":{"name":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","volume":"37 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSIPA.2014.7041823","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

This paper describes on the use of Support Vector Machine (SVM) based classification method on Khmer Printed Character-set Recognition (PCR) in bitmap document. Khmer language has been identified as one of the most complex language with the total of 74 alphabets and the wording compound can has up to 5 vertical levels. This paper proposes one new method, SVM for Khmer character classification system by using 3 different SVM kernels (Gaussian, Polynomial and Linear Kernel) on data training and recognition to find out the best kernel for Khmer language. The method allows us to use small training dataset by training different pieces of character training instead of training big amount of clusters. The classification uses binary data of 0 as white space and 1 as black pixel area of the character; each training piece of character has been stretched into a matrix of the binary data in all kinds of image size. Feature extraction is extracted from the matrix to use in SVM classification. After recognition, there are some rules to combine each cluster or character by using character levels or common mistake correction. The experiment of about pure 750 Khmer words or around 3000 characters show that SVM method with Gaussian Kernel produces a good result with better performance among all kernels. The system uses one font "Khmer OS Content" of the training data with font size 32pt to recognize 3 different font sizes. The accuracy of 28pt font size is 98.17%, 32pt is 98.62% and 36pt is 98.54% respectively.
基于支持向量机的高棉印刷字符集识别分类器
本文介绍了基于支持向量机(SVM)的分类方法在位图文档高棉印刷字符集识别(PCR)中的应用。高棉语被认为是最复杂的语言之一,共有74个字母,措辞复合可以有多达5个垂直层次。本文提出了一种新的方法——支持向量机(SVM)用于高棉语字符分类系统,通过使用3种不同的支持向量机核(高斯核、多项式核和线性核)进行数据训练和识别,找出高棉语的最佳核。该方法允许我们通过训练不同的字符训练片段来使用小的训练数据集,而不是训练大量的聚类。分类使用二进制数据0作为字符的空白区域,1作为字符的黑色像素区域;每个训练字符块被拉伸成各种图像大小的二进制数据矩阵。从矩阵中提取特征提取用于支持向量机分类。识别后,通过使用字符级别或常见错误纠错来组合每个聚类或字符。对750个高棉纯词或3000个左右字符的实验表明,高斯核支持向量机方法取得了较好的结果,在所有核中都有较好的性能。系统使用字号为32pt的训练数据中的一种字体“Khmer OS Content”来识别3种不同的字体大小。字号28pt的准确率为98.17%,字号32pt的准确率为98.62%,字号36pt的准确率为98.54%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信