Scaling Up Whole-Book Recognition

Pingping Xiu, H. Baird
{"title":"Scaling Up Whole-Book Recognition","authors":"Pingping Xiu, H. Baird","doi":"10.1109/ICDAR.2009.22","DOIUrl":null,"url":null,"abstract":"We describe the results of large-scale experiments with algorithms for unsupervised improvement of recognition of book-images using fully automatic mutual-entropy-based model adaptation. Each experiment is initialized with an imperfect iconic model derived from errorful OCR results, and a more or less perfect linguistic model, after which our fully automatic adaptation algorithm corrects the iconic model to achieve improved accuracy, guided only by evidence within the test set. Mutual-entropy scores measure disagreements between the two models and identify candidates for iconic model correction. Previously published experiments have shown that word error rates fall monotonically with passage length. Here we show similar results for character error rates extending over far longer passages up to fifty pages in length: we observed error rates were driven from 25\\% down to 1.9\\%. We present new experimental results to support the motivating principle of our strategy: that error rates and mutual-entropy scores are strongly correlated. Also, we discuss theoretical, algorithmic, and methodological challenges that we have encountered as we scale up experiments towards complete books.","PeriodicalId":433762,"journal":{"name":"2009 10th International Conference on Document Analysis and Recognition","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 10th International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2009.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

We describe the results of large-scale experiments with algorithms for unsupervised improvement of recognition of book-images using fully automatic mutual-entropy-based model adaptation. Each experiment is initialized with an imperfect iconic model derived from errorful OCR results, and a more or less perfect linguistic model, after which our fully automatic adaptation algorithm corrects the iconic model to achieve improved accuracy, guided only by evidence within the test set. Mutual-entropy scores measure disagreements between the two models and identify candidates for iconic model correction. Previously published experiments have shown that word error rates fall monotonically with passage length. Here we show similar results for character error rates extending over far longer passages up to fifty pages in length: we observed error rates were driven from 25\% down to 1.9\%. We present new experimental results to support the motivating principle of our strategy: that error rates and mutual-entropy scores are strongly correlated. Also, we discuss theoretical, algorithmic, and methodological challenges that we have encountered as we scale up experiments towards complete books.
提高整本书的认可度
我们描述了使用基于互熵的全自动模型自适应的无监督改进图书图像识别算法的大规模实验结果。每个实验初始化时都使用一个由错误OCR结果得出的不完美的符号模型和一个或多或少完美的语言模型,然后我们的全自动自适应算法仅在测试集中的证据指导下对符号模型进行校正,以提高准确性。互熵分数衡量两个模型之间的分歧,并确定候选的标志性模型校正。先前发表的实验表明,单词错误率随着段落长度的增加而单调下降。这里我们展示了类似的结果,字符错误率延伸到更长的段落长达50页:我们观察到错误率从25%下降到1.9%。我们提出了新的实验结果来支持我们策略的激励原则:错误率和互熵分数是强相关的。此外,我们还讨论了我们在向完整书籍扩展实验时遇到的理论、算法和方法上的挑战。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信