历史墨迹：19 世纪拉丁美洲西班牙语报纸语料库与 LLM OCR 更正

arXiv - CS - Digital Libraries Pub Date : 2024-07-04 DOI:arxiv-2407.12838

Laura Manrique-Gómez, Tony Montes, Rubén Manrique

{"title":"历史墨迹：19 世纪拉丁美洲西班牙语报纸语料库与 LLM OCR 更正","authors":"Laura Manrique-Gómez, Tony Montes, Rubén Manrique","doi":"arxiv-2407.12838","DOIUrl":null,"url":null,"abstract":"This paper presents two significant contributions: first, a novel dataset of\n19th-century Latin American press texts, which addresses the lack of\nspecialized corpora for historical and linguistic analysis in this region.\nSecond, it introduces a framework for OCR error correction and linguistic\nsurface form detection in digitized corpora, utilizing a Large Language Model.\nThis framework is adaptable to various contexts and, in this paper, is\nspecifically applied to the newly created dataset.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction\",\"authors\":\"Laura Manrique-Gómez, Tony Montes, Rubén Manrique\",\"doi\":\"arxiv-2407.12838\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents two significant contributions: first, a novel dataset of\\n19th-century Latin American press texts, which addresses the lack of\\nspecialized corpora for historical and linguistic analysis in this region.\\nSecond, it introduces a framework for OCR error correction and linguistic\\nsurface form detection in digitized corpora, utilizing a Large Language Model.\\nThis framework is adaptable to various contexts and, in this paper, is\\nspecifically applied to the newly created dataset.\",\"PeriodicalId\":501285,\"journal\":{\"name\":\"arXiv - CS - Digital Libraries\",\"volume\":\"37 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Digital Libraries\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.12838\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.12838","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文提出了两个重要贡献：第一，建立了一个包含 19 世纪拉丁美洲新闻文本的新数据集，解决了该地区缺乏历史和语言分析专用语料库的问题；第二，介绍了一个利用大型语言模型在数字化语料库中进行 OCR 纠错和语言表面形式检测的框架。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

This paper presents two significant contributions: first, a novel dataset of 19th-century Latin American press texts, which addresses the lack of specialized corpora for historical and linguistic analysis in this region. Second, it introduces a framework for OCR error correction and linguistic surface form detection in digitized corpora, utilizing a Large Language Model. This framework is adaptable to various contexts and, in this paper, is specifically applied to the newly created dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Digital Libraries

自引率

0.00%

发文量