Splitting Merged Characters of Kannada Benchmark Dataset using Simplified Paired-Valleys and L-Cut

H. Kumar, A. Madhavaraj, A. Ramakrishnan
{"title":"Splitting Merged Characters of Kannada Benchmark Dataset using Simplified Paired-Valleys and L-Cut","authors":"H. Kumar, A. Madhavaraj, A. Ramakrishnan","doi":"10.1109/NCC.2019.8732239","DOIUrl":null,"url":null,"abstract":"We reduce the computational complexity of the paired-valley algorithm for splitting merged characters, from Θ(N2) down to Θ(N), where $N$ is the number of symbols merged. We also propose an effective way (L-cut algorithm) to separate the merged half-consonants (known in Kannada as ottus) from the base symbols. We have created a benchmark dataset of 4033 sub-word images in Kannada, each comprising two or more merged characters. We test the recognition accuracy of Tesseract OCR on the created benchmark dataset, before and after applying our technique. The accuracy of Tesseract v3 OCR on the created dataset of 61.6% increases by 20% to a value of 81.7% after the splitting of the characters by our method. The algorithm's scalability to other scripts has been explored by limited experiments on Telugu and Tamil.","PeriodicalId":6870,"journal":{"name":"2019 National Conference on Communications (NCC)","volume":"29 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC.2019.8732239","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

We reduce the computational complexity of the paired-valley algorithm for splitting merged characters, from Θ(N2) down to Θ(N), where $N$ is the number of symbols merged. We also propose an effective way (L-cut algorithm) to separate the merged half-consonants (known in Kannada as ottus) from the base symbols. We have created a benchmark dataset of 4033 sub-word images in Kannada, each comprising two or more merged characters. We test the recognition accuracy of Tesseract OCR on the created benchmark dataset, before and after applying our technique. The accuracy of Tesseract v3 OCR on the created dataset of 61.6% increases by 20% to a value of 81.7% after the splitting of the characters by our method. The algorithm's scalability to other scripts has been explored by limited experiments on Telugu and Tamil.
基于简化成对谷和l -切的卡纳达语基准数据集合并字符分割
我们降低了分割合并字符的成对谷算法的计算复杂度,从Θ(N2)到Θ(N),其中$N$是合并的符号数。我们还提出了一种有效的方法(L-cut算法),将合并的半辅音(在卡纳达语中称为ottus)从基础符号中分离出来。我们在卡纳达语中创建了一个包含4033个子词图像的基准数据集,每个子词图像包含两个或多个合并字符。在应用我们的技术之前和之后,我们在创建的基准数据集上测试了Tesseract OCR的识别精度。使用我们的方法对字符进行分割后,在创建的数据集上,Tesseract v3 OCR的准确率从61.6%提高了20%,达到81.7%。通过对泰卢固语和泰米尔语的有限实验,探索了该算法对其他文字的可扩展性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信