{"title":"Splitting Merged Characters of Kannada Benchmark Dataset using Simplified Paired-Valleys and L-Cut","authors":"H. Kumar, A. Madhavaraj, A. Ramakrishnan","doi":"10.1109/NCC.2019.8732239","DOIUrl":null,"url":null,"abstract":"We reduce the computational complexity of the paired-valley algorithm for splitting merged characters, from Θ(N2) down to Θ(N), where $N$ is the number of symbols merged. We also propose an effective way (L-cut algorithm) to separate the merged half-consonants (known in Kannada as ottus) from the base symbols. We have created a benchmark dataset of 4033 sub-word images in Kannada, each comprising two or more merged characters. We test the recognition accuracy of Tesseract OCR on the created benchmark dataset, before and after applying our technique. The accuracy of Tesseract v3 OCR on the created dataset of 61.6% increases by 20% to a value of 81.7% after the splitting of the characters by our method. The algorithm's scalability to other scripts has been explored by limited experiments on Telugu and Tamil.","PeriodicalId":6870,"journal":{"name":"2019 National Conference on Communications (NCC)","volume":"29 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC.2019.8732239","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
We reduce the computational complexity of the paired-valley algorithm for splitting merged characters, from Θ(N2) down to Θ(N), where $N$ is the number of symbols merged. We also propose an effective way (L-cut algorithm) to separate the merged half-consonants (known in Kannada as ottus) from the base symbols. We have created a benchmark dataset of 4033 sub-word images in Kannada, each comprising two or more merged characters. We test the recognition accuracy of Tesseract OCR on the created benchmark dataset, before and after applying our technique. The accuracy of Tesseract v3 OCR on the created dataset of 61.6% increases by 20% to a value of 81.7% after the splitting of the characters by our method. The algorithm's scalability to other scripts has been explored by limited experiments on Telugu and Tamil.