CLOCR-C：利用预训练语言模型进行上下文关联 OCR 更正

arXiv - CS - Digital Libraries Pub Date : 2024-08-30 DOI:arxiv-2408.17428

Jonathan Bourne

{"title":"CLOCR-C：利用预训练语言模型进行上下文关联 OCR 更正","authors":"Jonathan Bourne","doi":"arxiv-2408.17428","DOIUrl":null,"url":null,"abstract":"The digitisation of historical print media archives is crucial for increasing\naccessibility to contemporary records. However, the process of Optical\nCharacter Recognition (OCR) used to convert physical records to digital text is\nprone to errors, particularly in the case of newspapers and periodicals due to\ntheir complex layouts. This paper introduces Context Leveraging OCR Correction\n(CLOCR-C), which utilises the infilling and context-adaptive abilities of\ntransformer-based language models (LMs) to improve OCR quality. The study aims\nto determine if LMs can perform post-OCR correction, improve downstream NLP\ntasks, and the value of providing the socio-cultural context as part of the\ncorrection process. Experiments were conducted using seven LMs on three\ndatasets: the 19th Century Serials Edition (NCSE) and two datasets from the\nOverproof collection. The results demonstrate that some LMs can significantly\nreduce error rates, with the top-performing model achieving over a 60%\nreduction in character error rate on the NCSE dataset. The OCR improvements\nextend to downstream tasks, such as Named Entity Recognition, with increased\nCosine Named Entity Similarity. Furthermore, the study shows that providing\nsocio-cultural context in the prompts improves performance, while misleading\nprompts lower performance. In addition to the findings, this study releases a\ndataset of 91 transcribed articles from the NCSE, containing a total of 40\nthousand words, to support further research in this area. The findings suggest\nthat CLOCR-C is a promising approach for enhancing the quality of existing\ndigital archives by leveraging the socio-cultural information embedded in the\nLMs and the text requiring correction.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"38 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models\",\"authors\":\"Jonathan Bourne\",\"doi\":\"arxiv-2408.17428\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The digitisation of historical print media archives is crucial for increasing\\naccessibility to contemporary records. However, the process of Optical\\nCharacter Recognition (OCR) used to convert physical records to digital text is\\nprone to errors, particularly in the case of newspapers and periodicals due to\\ntheir complex layouts. This paper introduces Context Leveraging OCR Correction\\n(CLOCR-C), which utilises the infilling and context-adaptive abilities of\\ntransformer-based language models (LMs) to improve OCR quality. The study aims\\nto determine if LMs can perform post-OCR correction, improve downstream NLP\\ntasks, and the value of providing the socio-cultural context as part of the\\ncorrection process. Experiments were conducted using seven LMs on three\\ndatasets: the 19th Century Serials Edition (NCSE) and two datasets from the\\nOverproof collection. The results demonstrate that some LMs can significantly\\nreduce error rates, with the top-performing model achieving over a 60%\\nreduction in character error rate on the NCSE dataset. The OCR improvements\\nextend to downstream tasks, such as Named Entity Recognition, with increased\\nCosine Named Entity Similarity. Furthermore, the study shows that providing\\nsocio-cultural context in the prompts improves performance, while misleading\\nprompts lower performance. In addition to the findings, this study releases a\\ndataset of 91 transcribed articles from the NCSE, containing a total of 40\\nthousand words, to support further research in this area. The findings suggest\\nthat CLOCR-C is a promising approach for enhancing the quality of existing\\ndigital archives by leveraging the socio-cultural information embedded in the\\nLMs and the text requiring correction.\",\"PeriodicalId\":501285,\"journal\":{\"name\":\"arXiv - CS - Digital Libraries\",\"volume\":\"38 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Digital Libraries\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.17428\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.17428","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

历史印刷媒体档案的数字化对于提高当代记录的可访问性至关重要。然而，用于将物理记录转换为数字文本的光学字符识别（OCR）过程很容易出错，尤其是报纸和期刊，因为它们的版式非常复杂。本文介绍了基于上下文的 OCR 纠错（CLOCR-C），它利用基于转换器的语言模型（LM）的填充和上下文自适应能力来提高 OCR 质量。该研究旨在确定 LM 是否能够执行 OCR 后校正、改进下游 NLP 任务以及在校正过程中提供社会文化背景的价值。我们使用七种 LM 在三个数据集上进行了实验：19 世纪丛书版（NCSE）和来自 Overproof 数据集的两个数据集。结果表明，一些 LM 可以显著降低错误率，其中表现最好的模型在 NCSE 数据集上的字符错误率降低了 60% 以上。OCR 的改进还延伸到了下游任务，如命名实体识别，余弦命名实体相似度得到了提高。此外，研究还表明，在提示中提供社会文化背景可以提高性能，而误导性提示则会降低性能。除研究结果外，本研究还发布了来自 NCSE 的 91 篇转录文章的数据集，共包含 40,000 个单词，以支持该领域的进一步研究。研究结果表明，CLOCR-C 是一种很有前途的方法，它可以利用蕴含在 LM 和需要校正的文本中的社会文化信息来提高现有数字档案的质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models

The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs on three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60% reduction in character error rate on the NCSE dataset. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. In addition to the findings, this study releases a dataset of 91 transcribed articles from the NCSE, containing a total of 40 thousand words, to support further research in this area. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Digital Libraries

自引率

0.00%

发文量