OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

IF 2 Q2 SOCIAL SCIENCES, MATHEMATICAL METHODS
Thomas Hegghammer
{"title":"OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment","authors":"Thomas Hegghammer","doi":"10.31235/osf.io/6zfvs","DOIUrl":null,"url":null,"abstract":"Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans ( n  = 322) and Arabic-language article scans ( n  = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.","PeriodicalId":29946,"journal":{"name":"Journal of Computational Social Science","volume":"57 1","pages":"861-882"},"PeriodicalIF":2.0000,"publicationDate":"2021-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Social Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31235/osf.io/6zfvs","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SOCIAL SCIENCES, MATHEMATICAL METHODS","Score":null,"Total":0}
引用次数: 26

Abstract

Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans ( n  = 322) and Arabic-language article scans ( n  = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.
OCR与Tesseract、Amazon text和Google Document AI:一个基准实验
光学字符识别(OCR)可以打开未被充分研究的历史文献进行计算分析,但OCR软件的准确性存在差异。本文报告了一个基准测试实验,比较了Tesseract、Amazon text和Google Document AI在英语和阿拉伯语文本图像上的性能。英语图书扫描(n = 322)和阿拉伯语文章扫描(n = 100)使用不同类型的人工噪声对18,568个文档的语料库进行了43次复制,产生了51,304个处理请求。文档人工智能提供了最好的结果,基于服务器的处理器(文本和文档人工智能)比Tesseract表现得更好,特别是在嘈杂的文档上。英语的准确率要比阿拉伯语高得多。具体说明三种主要OCR产品的相对性能和常见噪声类型的差异影响,可以帮助学者确定更好的OCR解决方案,以满足他们的研究需求。测试材料已保存在公开可用的“噪声OCR数据集”(NOD)中,以便在未来的基准研究中重用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Computational Social Science
Journal of Computational Social Science SOCIAL SCIENCES, MATHEMATICAL METHODS-
CiteScore
6.20
自引率
6.20%
发文量
30
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信