Persian OCR with Cascaded Convolutional Neural Networks Supported by Language Model

2020 10th International Conference on Computer and Knowledge Engineering (ICCKE) Pub Date : 2020-10-29 DOI:10.1109/ICCKE50421.2020.9303691

M. PourReza, R. Derakhshan, S. Bibak, M. Fallah, H. Fayyazi, M. Sabokrou

{"title":"Persian OCR with Cascaded Convolutional Neural Networks Supported by Language Model","authors":"M. PourReza, R. Derakhshan, S. Bibak, M. Fallah, H. Fayyazi, M. Sabokrou","doi":"10.1109/ICCKE50421.2020.9303691","DOIUrl":null,"url":null,"abstract":"Persian1OCR is a difficult task because of some specific features of Persian writing style, like different styles of letters in different places of the word and similarity of letters to each other. Recognizing sub-words instead of individual letters can reduce these difficulties. In this manner sub-word segmentation is the critical task of pre-process step. In this paper, a cascaded Convolutional Neural Network is utilized to convert sub-word images into text. A large dictionary of Persian sub-word images with different font styles is used as training data and an Auto-Encoder enriches the features needed for constructing the cascade classifier structure. The initial classifier learns the overall structure of sub-word images that its training data is the result of applying k-means clustering on the huge sub-word image dataset. The later classifier finds the exact text equivalent of the sub-word image. A word segmentation method forms the words based on extracted sub-words. This method use contour distances as a measure for distinguishing words from sub-words. The initial OCR result is improved using Natural Language Processing techniques. Two fast search structures in word dictionaries with the help of a language model build the post-processing module and substitute the misspelled extracted words with the best alternative. Comparison results with Tesseract OCR engine shows the superiority of the algorithm.","PeriodicalId":402043,"journal":{"name":"2020 10th International Conference on Computer and Knowledge Engineering (ICCKE)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 10th International Conference on Computer and Knowledge Engineering (ICCKE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCKE50421.2020.9303691","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Persian1OCR is a difficult task because of some specific features of Persian writing style, like different styles of letters in different places of the word and similarity of letters to each other. Recognizing sub-words instead of individual letters can reduce these difficulties. In this manner sub-word segmentation is the critical task of pre-process step. In this paper, a cascaded Convolutional Neural Network is utilized to convert sub-word images into text. A large dictionary of Persian sub-word images with different font styles is used as training data and an Auto-Encoder enriches the features needed for constructing the cascade classifier structure. The initial classifier learns the overall structure of sub-word images that its training data is the result of applying k-means clustering on the huge sub-word image dataset. The later classifier finds the exact text equivalent of the sub-word image. A word segmentation method forms the words based on extracted sub-words. This method use contour distances as a measure for distinguishing words from sub-words. The initial OCR result is improved using Natural Language Processing techniques. Two fast search structures in word dictionaries with the help of a language model build the post-processing module and substitute the misspelled extracted words with the best alternative. Comparison results with Tesseract OCR engine shows the superiority of the algorithm.

查看原文本刊更多论文

语言模型支持的级联卷积神经网络波斯语OCR

波斯语的ocr是一项艰巨的任务，因为波斯语的书写风格有一些特殊的特点，比如单词不同地方的字母风格不同，字母之间的相似性也不同。识别子词而不是单个字母可以减少这些困难。在这种情况下，子词分割是预处理步骤的关键任务。本文利用级联卷积神经网络将子词图像转换为文本。使用不同字体风格的波斯语子词图像的大字典作为训练数据，并使用自编码器丰富了构建级联分类器结构所需的特征。初始分类器学习子词图像的整体结构，其训练数据是在庞大的子词图像数据集上应用k-means聚类的结果。后面的分类器找到子词图像的精确文本等价物。分词方法基于提取的子词形成词。该方法使用轮廓距离作为区分词与子词的度量。使用自然语言处理技术改进了初始OCR结果。在语言模型的帮助下，两个快速搜索结构在单词字典中构建后处理模块，并用最佳替代词替换拼写错误的提取词。与Tesseract OCR引擎的对比结果表明了该算法的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 10th International Conference on Computer and Knowledge Engineering (ICCKE)

自引率

0.00%

发文量