{"title":"历史墨迹:19 世纪拉丁美洲西班牙语报纸语料库与 LLM OCR 更正","authors":"Laura Manrique-Gómez, Tony Montes, Rubén Manrique","doi":"arxiv-2407.12838","DOIUrl":null,"url":null,"abstract":"This paper presents two significant contributions: first, a novel dataset of\n19th-century Latin American press texts, which addresses the lack of\nspecialized corpora for historical and linguistic analysis in this region.\nSecond, it introduces a framework for OCR error correction and linguistic\nsurface form detection in digitized corpora, utilizing a Large Language Model.\nThis framework is adaptable to various contexts and, in this paper, is\nspecifically applied to the newly created dataset.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction\",\"authors\":\"Laura Manrique-Gómez, Tony Montes, Rubén Manrique\",\"doi\":\"arxiv-2407.12838\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents two significant contributions: first, a novel dataset of\\n19th-century Latin American press texts, which addresses the lack of\\nspecialized corpora for historical and linguistic analysis in this region.\\nSecond, it introduces a framework for OCR error correction and linguistic\\nsurface form detection in digitized corpora, utilizing a Large Language Model.\\nThis framework is adaptable to various contexts and, in this paper, is\\nspecifically applied to the newly created dataset.\",\"PeriodicalId\":501285,\"journal\":{\"name\":\"arXiv - CS - Digital Libraries\",\"volume\":\"37 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Digital Libraries\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.12838\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.12838","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
This paper presents two significant contributions: first, a novel dataset of
19th-century Latin American press texts, which addresses the lack of
specialized corpora for historical and linguistic analysis in this region.
Second, it introduces a framework for OCR error correction and linguistic
surface form detection in digitized corpora, utilizing a Large Language Model.
This framework is adaptable to various contexts and, in this paper, is
specifically applied to the newly created dataset.