Amharic Character Recognition Based on Features Extracted by CNN and Auto-Encoder Models

Efrem Yohannes Obsie, Hongchun Qu, Qingqing Huang
{"title":"Amharic Character Recognition Based on Features Extracted by CNN and Auto-Encoder Models","authors":"Efrem Yohannes Obsie, Hongchun Qu, Qingqing Huang","doi":"10.1145/3474963.3474972","DOIUrl":null,"url":null,"abstract":"Amharic is an ancient Semitic language that serves as the official language of the Federal Republic of Ethiopia. Due to the large number of historical and literary documents written in this language, an automated OCR system is highly demanded. However, previous approaches have been based on traditional machine learning algorithms that focus on hand-crafted feature extraction, and the performance of these methods is greatly affected by the presence of a large set of structurally similar characters. Therefore, according to various studies on Amharic character, this problem can be solved by examining robust feature extraction techniques. In this study, we proposed a hybrid method that uses deep learning models Convolutional Neural Network (CNN) and Convolutional Auto-Encoder (CAE) for feature extraction, Random Forest (RF) and Mutual Information (MI) feature selection methods for selecting top features and a traditional machine learning algorithm Support Vector Machine (SVM) for classification. First, the features extracted by the two deep models were combined to form hybrid features, and then top features were selected by applying feature selection. The common features selected by the two feature selection methods were later used for recognition by SVM. Experimental results using CNN extracted features achieved an accuracy of 96.03% while using CAE extracted features achieved an accuracy of 92.52%. On the other hand, the proposed method based on the intersection features selected by the RF and MI feature selection methods achieved an accuracy of 97.06%.","PeriodicalId":277800,"journal":{"name":"Proceedings of the 13th International Conference on Computer Modeling and Simulation","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th International Conference on Computer Modeling and Simulation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3474963.3474972","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Amharic is an ancient Semitic language that serves as the official language of the Federal Republic of Ethiopia. Due to the large number of historical and literary documents written in this language, an automated OCR system is highly demanded. However, previous approaches have been based on traditional machine learning algorithms that focus on hand-crafted feature extraction, and the performance of these methods is greatly affected by the presence of a large set of structurally similar characters. Therefore, according to various studies on Amharic character, this problem can be solved by examining robust feature extraction techniques. In this study, we proposed a hybrid method that uses deep learning models Convolutional Neural Network (CNN) and Convolutional Auto-Encoder (CAE) for feature extraction, Random Forest (RF) and Mutual Information (MI) feature selection methods for selecting top features and a traditional machine learning algorithm Support Vector Machine (SVM) for classification. First, the features extracted by the two deep models were combined to form hybrid features, and then top features were selected by applying feature selection. The common features selected by the two feature selection methods were later used for recognition by SVM. Experimental results using CNN extracted features achieved an accuracy of 96.03% while using CAE extracted features achieved an accuracy of 92.52%. On the other hand, the proposed method based on the intersection features selected by the RF and MI feature selection methods achieved an accuracy of 97.06%.
基于CNN和自编码器模型提取特征的阿姆哈拉语字符识别
阿姆哈拉语是一种古老的闪米特语,是埃塞俄比亚联邦共和国的官方语言。由于大量的历史文献和文学文献都是用这种语言编写的,因此对自动化OCR系统的要求很高。然而,之前的方法是基于传统的机器学习算法,专注于手工特征提取,这些方法的性能受到大量结构相似字符的存在的极大影响。因此,根据对阿姆哈拉语特征的各种研究,可以通过研究鲁棒特征提取技术来解决这一问题。在这项研究中,我们提出了一种混合方法,使用深度学习模型卷积神经网络(CNN)和卷积自编码器(CAE)进行特征提取,随机森林(RF)和互信息(MI)特征选择方法选择顶部特征,传统机器学习算法支持向量机(SVM)进行分类。首先,将两种深度模型提取的特征进行组合形成混合特征,然后应用特征选择方法选出最上面的特征。然后将两种特征选择方法选择的共同特征用于支持向量机识别。实验结果表明,使用CNN提取特征的准确率为96.03%,而使用CAE提取特征的准确率为92.52%。另一方面,基于RF和MI特征选择方法选择的相交特征的方法,准确率达到97.06%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信