{"title":"Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction","authors":"Laura Manrique-Gómez, Tony Montes, Rubén Manrique","doi":"arxiv-2407.12838","DOIUrl":null,"url":null,"abstract":"This paper presents two significant contributions: first, a novel dataset of\n19th-century Latin American press texts, which addresses the lack of\nspecialized corpora for historical and linguistic analysis in this region.\nSecond, it introduces a framework for OCR error correction and linguistic\nsurface form detection in digitized corpora, utilizing a Large Language Model.\nThis framework is adaptable to various contexts and, in this paper, is\nspecifically applied to the newly created dataset.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.12838","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper presents two significant contributions: first, a novel dataset of
19th-century Latin American press texts, which addresses the lack of
specialized corpora for historical and linguistic analysis in this region.
Second, it introduces a framework for OCR error correction and linguistic
surface form detection in digitized corpora, utilizing a Large Language Model.
This framework is adaptable to various contexts and, in this paper, is
specifically applied to the newly created dataset.