{"title":"OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment","authors":"Thomas Hegghammer","doi":"10.31235/osf.io/6zfvs","DOIUrl":"https://doi.org/10.31235/osf.io/6zfvs","url":null,"abstract":"Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans ( n = 322) and Arabic-language article scans ( n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.","PeriodicalId":29946,"journal":{"name":"Journal of Computational Social Science","volume":"57 1","pages":"861-882"},"PeriodicalIF":3.2,"publicationDate":"2021-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78057631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mitigation strategies against cascading failures within a project activity network","authors":"C. Ellinas, C. Nicolaides, Naoki Masuda","doi":"10.1007/s42001-021-00123-x","DOIUrl":"https://doi.org/10.1007/s42001-021-00123-x","url":null,"abstract":"","PeriodicalId":29946,"journal":{"name":"Journal of Computational Social Science","volume":"10 1","pages":"383 - 400"},"PeriodicalIF":3.2,"publicationDate":"2021-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85951292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An analysis of US domestic migration via subset-stable measures of administrative data","authors":"B. Klemens","doi":"10.1007/S42001-021-00124-W","DOIUrl":"https://doi.org/10.1007/S42001-021-00124-W","url":null,"abstract":"","PeriodicalId":29946,"journal":{"name":"Journal of Computational Social Science","volume":"24 1","pages":"351-382"},"PeriodicalIF":3.2,"publicationDate":"2021-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84722138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Battle of positioning: exploring the role of bridges in competitive diffusion","authors":"Jie Gu, Yunjie Xu","doi":"10.1007/s42001-021-00127-7","DOIUrl":"https://doi.org/10.1007/s42001-021-00127-7","url":null,"abstract":"","PeriodicalId":29946,"journal":{"name":"Journal of Computational Social Science","volume":"26 1","pages":"319 - 350"},"PeriodicalIF":3.2,"publicationDate":"2021-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84383241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Users roles identification on online crowdsourced Q&A platforms and encyclopedias: a survey","authors":"A. Saxena, Harita Reddy","doi":"10.1007/s42001-021-00125-9","DOIUrl":"https://doi.org/10.1007/s42001-021-00125-9","url":null,"abstract":"","PeriodicalId":29946,"journal":{"name":"Journal of Computational Social Science","volume":"208 1","pages":"285 - 317"},"PeriodicalIF":3.2,"publicationDate":"2021-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80548578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Saxon, Julia Koschinsky, Karina Acosta, V. Anguiano, L. Anselin, Sergio J. Rey
{"title":"An open software environment to make spatial access metrics more accessible","authors":"J. Saxon, Julia Koschinsky, Karina Acosta, V. Anguiano, L. Anselin, Sergio J. Rey","doi":"10.1007/s42001-021-00126-8","DOIUrl":"https://doi.org/10.1007/s42001-021-00126-8","url":null,"abstract":"","PeriodicalId":29946,"journal":{"name":"Journal of Computational Social Science","volume":"95 1 1","pages":"265 - 284"},"PeriodicalIF":3.2,"publicationDate":"2021-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83350158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing Twitter networks using graph embeddings: an application to the British case","authors":"Miguel Won, Jorge M. Fernandes","doi":"10.1007/s42001-021-00128-6","DOIUrl":"https://doi.org/10.1007/s42001-021-00128-6","url":null,"abstract":"","PeriodicalId":29946,"journal":{"name":"Journal of Computational Social Science","volume":"16 1","pages":"253 - 263"},"PeriodicalIF":3.2,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78522235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Clelland, H. Colgate, Daryl R. DeFord, Beth Malmskog, Flavia Sancier-Barbosa
{"title":"Colorado in context: Congressional redistricting and competing fairness criteria in Colorado","authors":"J. Clelland, H. Colgate, Daryl R. DeFord, Beth Malmskog, Flavia Sancier-Barbosa","doi":"10.1007/s42001-021-00119-7","DOIUrl":"https://doi.org/10.1007/s42001-021-00119-7","url":null,"abstract":"","PeriodicalId":29946,"journal":{"name":"Journal of Computational Social Science","volume":"14 1","pages":"189 - 226"},"PeriodicalIF":3.2,"publicationDate":"2021-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73643669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A network view on reliability: using machine learning to understand how we assess news websites","authors":"Tobias Blanke, T. Venturini","doi":"10.1007/s42001-021-00116-w","DOIUrl":"https://doi.org/10.1007/s42001-021-00116-w","url":null,"abstract":"","PeriodicalId":29946,"journal":{"name":"Journal of Computational Social Science","volume":"22 1","pages":"69 - 88"},"PeriodicalIF":3.2,"publicationDate":"2021-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76715377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amirarsalan Rajabi, Alexander V. Mantzaris, K. S. Atwal, I. Garibay
{"title":"Exploring the disparity of influence between users in the discussion of Brexit on Twitter","authors":"Amirarsalan Rajabi, Alexander V. Mantzaris, K. S. Atwal, I. Garibay","doi":"10.1007/s42001-021-00112-0","DOIUrl":"https://doi.org/10.1007/s42001-021-00112-0","url":null,"abstract":"","PeriodicalId":29946,"journal":{"name":"Journal of Computational Social Science","volume":"112 1","pages":"903 - 917"},"PeriodicalIF":3.2,"publicationDate":"2021-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76373172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}