Splitting Merged Characters of Kannada Benchmark Dataset using Simplified Paired-Valleys and L-Cut

2019 National Conference on Communications (NCC) Pub Date : 2019-02-01 DOI:10.1109/NCC.2019.8732239

H. Kumar, A. Madhavaraj, A. Ramakrishnan

引用次数: 4

Abstract

We reduce the computational complexity of the paired-valley algorithm for splitting merged characters, from Θ(N2) down to Θ(N), where $N$ is the number of symbols merged. We also propose an effective way (L-cut algorithm) to separate the merged half-consonants (known in Kannada as ottus) from the base symbols. We have created a benchmark dataset of 4033 sub-word images in Kannada, each comprising two or more merged characters. We test the recognition accuracy of Tesseract OCR on the created benchmark dataset, before and after applying our technique. The accuracy of Tesseract v3 OCR on the created dataset of 61.6% increases by 20% to a value of 81.7% after the splitting of the characters by our method. The algorithm's scalability to other scripts has been explored by limited experiments on Telugu and Tamil.

查看原文本刊更多论文

基于简化成对谷和l -切的卡纳达语基准数据集合并字符分割

我们降低了分割合并字符的成对谷算法的计算复杂度，从Θ(N2)到Θ(N)，其中$N$是合并的符号数。我们还提出了一种有效的方法(L-cut算法)，将合并的半辅音(在卡纳达语中称为ottus)从基础符号中分离出来。我们在卡纳达语中创建了一个包含4033个子词图像的基准数据集，每个子词图像包含两个或多个合并字符。在应用我们的技术之前和之后，我们在创建的基准数据集上测试了Tesseract OCR的识别精度。使用我们的方法对字符进行分割后，在创建的数据集上，Tesseract v3 OCR的准确率从61.6%提高了20%，达到81.7%。通过对泰卢固语和泰米尔语的有限实验，探索了该算法对其他文字的可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 National Conference on Communications (NCC)

自引率

0.00%

发文量