Improving Bangla OCR output through correction algorithms

2016 10th International Conference on Software, Knowledge, Information Management & Applications (SKIMA) Pub Date : 1900-01-01 DOI:10.1109/SKIMA.2016.7916243

Md Sajib Ahmed, Teresa Gonçalves, H. Sarwar

{"title":"Improving Bangla OCR output through correction algorithms","authors":"Md Sajib Ahmed, Teresa Gonçalves, H. Sarwar","doi":"10.1109/SKIMA.2016.7916243","DOIUrl":null,"url":null,"abstract":"Bangla OCR (Optical Character Recognition) is a long deserving software for Bengali community all over the world. Numerous efforts suggest that due to the inherent complex nature of Bangla alphabet and its word formation process development of high fidelity OCR producing a reasonably acceptable output still remains a challenge. One possible way of improvement is by using post processing of OCR's output; algorithms such as Edit Distance and the use of n-grams statistical information have been used to rectify misspelled words in language processing. This work presents the first known approach to use these algorithms to replace misrecognized words produced by Bangla OCR. The assessment is made on a set of fifty documents written in Bangla script and uses a dictionary of 541,167 words. The proposed correction model can correct several words lowering the recognition error rate by 2.87% and 3.18% for the character based n-gram and edit distance algorithms respectively. The developed system suggests a list of 5 (five) alternatives for a misspelled word. It is found that in 33.82% cases, the correct word is the topmost suggestion of 5 words list for n-gram algorithm while using Edit distance algorithm the first word in the suggestion properly matches 36.31% of the cases. This work will ignite rooms of thoughts for possible improvements in character recognition endeavor.","PeriodicalId":417370,"journal":{"name":"2016 10th International Conference on Software, Knowledge, Information Management & Applications (SKIMA)","volume":"891 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 10th International Conference on Software, Knowledge, Information Management & Applications (SKIMA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SKIMA.2016.7916243","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Bangla OCR (Optical Character Recognition) is a long deserving software for Bengali community all over the world. Numerous efforts suggest that due to the inherent complex nature of Bangla alphabet and its word formation process development of high fidelity OCR producing a reasonably acceptable output still remains a challenge. One possible way of improvement is by using post processing of OCR's output; algorithms such as Edit Distance and the use of n-grams statistical information have been used to rectify misspelled words in language processing. This work presents the first known approach to use these algorithms to replace misrecognized words produced by Bangla OCR. The assessment is made on a set of fifty documents written in Bangla script and uses a dictionary of 541,167 words. The proposed correction model can correct several words lowering the recognition error rate by 2.87% and 3.18% for the character based n-gram and edit distance algorithms respectively. The developed system suggests a list of 5 (five) alternatives for a misspelled word. It is found that in 33.82% cases, the correct word is the topmost suggestion of 5 words list for n-gram algorithm while using Edit distance algorithm the first word in the suggestion properly matches 36.31% of the cases. This work will ignite rooms of thoughts for possible improvements in character recognition endeavor.

查看原文本刊更多论文

通过校正算法改进孟加拉语OCR输出

孟加拉语OCR(光学字符识别)是一个长期以来值得全世界孟加拉语社区使用的软件。大量的研究表明，由于孟加拉语字母表及其构词过程固有的复杂性，开发高保真OCR以产生合理可接受的输出仍然是一个挑战。一种可能的改进方法是使用OCR输出的后处理;编辑距离和n-grams统计信息等算法已被用于纠正语言处理中的拼写错误。这项工作提出了已知的第一个使用这些算法来替换由孟加拉语OCR产生的错误识别词的方法。这项评估是根据50份用孟加拉文字写成的文件，并使用了一部包含541,167个单词的词典。所提出的校正模型可以对多个单词进行校正，将基于字符的n-gram和编辑距离算法的识别错误率分别降低2.87%和3.18%。开发的系统为拼写错误的单词提供5(5)个替代选项。研究发现，在n-gram算法的5个单词列表中，正确的单词是最前面的建议，在33.82%的情况下，使用编辑距离算法，建议中的第一个单词正确匹配的情况占36.31%。这项工作将点燃人们对可能改进字符识别工作的想法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 10th International Conference on Software, Knowledge, Information Management & Applications (SKIMA)

自引率

0.00%

发文量