OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

IF 2 Q2 SOCIAL SCIENCES, MATHEMATICAL METHODS

Journal of Computational Social Science Pub Date : 2021-06-24 DOI:10.31235/osf.io/6zfvs

Thomas Hegghammer

{"title":"OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment","authors":"Thomas Hegghammer","doi":"10.31235/osf.io/6zfvs","DOIUrl":null,"url":null,"abstract":"Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans ( n = 322) and Arabic-language article scans ( n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.","PeriodicalId":29946,"journal":{"name":"Journal of Computational Social Science","volume":"57 1","pages":"861-882"},"PeriodicalIF":2.0000,"publicationDate":"2021-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Social Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31235/osf.io/6zfvs","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SOCIAL SCIENCES, MATHEMATICAL METHODS","Score":null,"Total":0}

引用次数: 26

Abstract

Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans ( n = 322) and Arabic-language article scans ( n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.

查看原文本刊更多论文

OCR与Tesseract、Amazon text和Google Document AI:一个基准实验

光学字符识别(OCR)可以打开未被充分研究的历史文献进行计算分析，但OCR软件的准确性存在差异。本文报告了一个基准测试实验，比较了Tesseract、Amazon text和Google Document AI在英语和阿拉伯语文本图像上的性能。英语图书扫描(n = 322)和阿拉伯语文章扫描(n = 100)使用不同类型的人工噪声对18,568个文档的语料库进行了43次复制，产生了51,304个处理请求。文档人工智能提供了最好的结果，基于服务器的处理器(文本和文档人工智能)比Tesseract表现得更好，特别是在嘈杂的文档上。英语的准确率要比阿拉伯语高得多。具体说明三种主要OCR产品的相对性能和常见噪声类型的差异影响，可以帮助学者确定更好的OCR解决方案，以满足他们的研究需求。测试材料已保存在公开可用的“噪声OCR数据集”(NOD)中，以便在未来的基准研究中重用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Computational Social Science SOCIAL SCIENCES, MATHEMATICAL METHODS-

CiteScore

6.20

自引率

6.20%

发文量