{"title":"CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models","authors":"Jonathan Bourne","doi":"arxiv-2408.17428","DOIUrl":null,"url":null,"abstract":"The digitisation of historical print media archives is crucial for increasing\naccessibility to contemporary records. However, the process of Optical\nCharacter Recognition (OCR) used to convert physical records to digital text is\nprone to errors, particularly in the case of newspapers and periodicals due to\ntheir complex layouts. This paper introduces Context Leveraging OCR Correction\n(CLOCR-C), which utilises the infilling and context-adaptive abilities of\ntransformer-based language models (LMs) to improve OCR quality. The study aims\nto determine if LMs can perform post-OCR correction, improve downstream NLP\ntasks, and the value of providing the socio-cultural context as part of the\ncorrection process. Experiments were conducted using seven LMs on three\ndatasets: the 19th Century Serials Edition (NCSE) and two datasets from the\nOverproof collection. The results demonstrate that some LMs can significantly\nreduce error rates, with the top-performing model achieving over a 60%\nreduction in character error rate on the NCSE dataset. The OCR improvements\nextend to downstream tasks, such as Named Entity Recognition, with increased\nCosine Named Entity Similarity. Furthermore, the study shows that providing\nsocio-cultural context in the prompts improves performance, while misleading\nprompts lower performance. In addition to the findings, this study releases a\ndataset of 91 transcribed articles from the NCSE, containing a total of 40\nthousand words, to support further research in this area. The findings suggest\nthat CLOCR-C is a promising approach for enhancing the quality of existing\ndigital archives by leveraging the socio-cultural information embedded in the\nLMs and the text requiring correction.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"38 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.17428","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The digitisation of historical print media archives is crucial for increasing
accessibility to contemporary records. However, the process of Optical
Character Recognition (OCR) used to convert physical records to digital text is
prone to errors, particularly in the case of newspapers and periodicals due to
their complex layouts. This paper introduces Context Leveraging OCR Correction
(CLOCR-C), which utilises the infilling and context-adaptive abilities of
transformer-based language models (LMs) to improve OCR quality. The study aims
to determine if LMs can perform post-OCR correction, improve downstream NLP
tasks, and the value of providing the socio-cultural context as part of the
correction process. Experiments were conducted using seven LMs on three
datasets: the 19th Century Serials Edition (NCSE) and two datasets from the
Overproof collection. The results demonstrate that some LMs can significantly
reduce error rates, with the top-performing model achieving over a 60%
reduction in character error rate on the NCSE dataset. The OCR improvements
extend to downstream tasks, such as Named Entity Recognition, with increased
Cosine Named Entity Similarity. Furthermore, the study shows that providing
socio-cultural context in the prompts improves performance, while misleading
prompts lower performance. In addition to the findings, this study releases a
dataset of 91 transcribed articles from the NCSE, containing a total of 40
thousand words, to support further research in this area. The findings suggest
that CLOCR-C is a promising approach for enhancing the quality of existing
digital archives by leveraging the socio-cultural information embedded in the
LMs and the text requiring correction.