Context-Dependent Confusions Rules for Building Error Model Using Weighted Finite State Transducers for OCR Post-Processing

M. A. Azawi, T. Breuel
{"title":"Context-Dependent Confusions Rules for Building Error Model Using Weighted Finite State Transducers for OCR Post-Processing","authors":"M. A. Azawi, T. Breuel","doi":"10.1109/DAS.2014.75","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a new technique to correct the OCR errors by means of weighted finite state transducers(WFST) with context-dependent confusion rules. We translate the OCR confusions which appear in the recognition outputs into edit operations, e.g. insertions, deletions and substitutions using Levenshtein edit distance algorithm. The edit operations are extracted in a form of rules with respect to the context of the incorrect string to build an error model using weighted finite state transducers. The context-dependent rules help to fit the rule in the appropriate strings. Our new error model avoids the calculations that occur in searching the language model and it also makes the language model eligible to correct incorrect words by using context-dependent confusion rules. Our approach is language independent. It designed to deal with different number of errors. It has no limited words size. In the set of experiments conducted on the ocred pages from the UWIII dataset, our new proposed error model outperforms. The evaluation shows the error rate of our model on the UWIII testset is 0.68%, while the baseline is 1.14% and the error rate of the existing state-of-the-art single character rules-based approach is 1.0%.","PeriodicalId":220495,"journal":{"name":"2014 11th IAPR International Workshop on Document Analysis Systems","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 11th IAPR International Workshop on Document Analysis Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAS.2014.75","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

In this paper, we propose a new technique to correct the OCR errors by means of weighted finite state transducers(WFST) with context-dependent confusion rules. We translate the OCR confusions which appear in the recognition outputs into edit operations, e.g. insertions, deletions and substitutions using Levenshtein edit distance algorithm. The edit operations are extracted in a form of rules with respect to the context of the incorrect string to build an error model using weighted finite state transducers. The context-dependent rules help to fit the rule in the appropriate strings. Our new error model avoids the calculations that occur in searching the language model and it also makes the language model eligible to correct incorrect words by using context-dependent confusion rules. Our approach is language independent. It designed to deal with different number of errors. It has no limited words size. In the set of experiments conducted on the ocred pages from the UWIII dataset, our new proposed error model outperforms. The evaluation shows the error rate of our model on the UWIII testset is 0.68%, while the baseline is 1.14% and the error rate of the existing state-of-the-art single character rules-based approach is 1.0%.
基于上下文的模糊规则建立基于加权有限状态传感器的OCR后处理误差模型
本文提出了一种基于上下文相关混淆规则的加权有限状态传感器(WFST)校正OCR误差的新方法。我们使用Levenshtein编辑距离算法将识别输出中出现的OCR混淆转换为编辑操作,例如插入、删除和替换。根据不正确字符串的上下文以规则的形式提取编辑操作,以使用加权有限状态传感器构建错误模型。与上下文相关的规则有助于将规则匹配到适当的字符串中。我们的新错误模型避免了查找语言模型时的计算,并且使语言模型能够通过使用上下文相关的混淆规则来纠正错误单词。我们的方法与语言无关。它被设计用来处理不同数量的错误。它没有字数限制。在uiii数据集的原始页面上进行的一组实验中,我们提出的新误差模型表现得更好。评估表明,我们的模型在uii测试集上的错误率为0.68%,而基线为1.14%,现有最先进的基于单字符规则的方法的错误率为1.0%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信