{"title":"保加利亚历史文献的后OCR 文本更正","authors":"Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov","doi":"arxiv-2409.00527","DOIUrl":null,"url":null,"abstract":"The digitization of historical documents is crucial for preserving the\ncultural heritage of the society. An important step in this process is\nconverting scanned images to text using Optical Character Recognition (OCR),\nwhich can enable further search, information extraction, etc. Unfortunately,\nthis is a hard problem as standard OCR tools are not tailored to deal with\nhistorical orthography as well as with challenging layouts. Thus, it is\nstandard to apply an additional text correction step on the OCR output when\ndealing with such documents. In this work, we focus on Bulgarian, and we create\nthe first benchmark dataset for evaluating the OCR text correction for\nhistorical Bulgarian documents written in the first standardized Bulgarian\northography: the Drinov orthography from the 19th century. We further develop a\nmethod for automatically generating synthetic data in this orthography, as well\nas in the subsequent Ivanchev orthography, by leveraging vast amounts of\ncontemporary literature Bulgarian texts. We then use state-of-the-art LLMs and\nencoder-decoder framework which we augment with diagonal attention loss and\ncopy and coverage mechanisms to improve the post-OCR text correction. The\nproposed method reduces the errors introduced during recognition and improves\nthe quality of the documents by 25\\%, which is an increase of 16\\% compared to\nthe state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data\nand code at \\url{https://github.com/angelbeshirov/post-ocr-text-correction}.}","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"4 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Post-OCR Text Correction for Bulgarian Historical Documents\",\"authors\":\"Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov\",\"doi\":\"arxiv-2409.00527\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The digitization of historical documents is crucial for preserving the\\ncultural heritage of the society. An important step in this process is\\nconverting scanned images to text using Optical Character Recognition (OCR),\\nwhich can enable further search, information extraction, etc. Unfortunately,\\nthis is a hard problem as standard OCR tools are not tailored to deal with\\nhistorical orthography as well as with challenging layouts. Thus, it is\\nstandard to apply an additional text correction step on the OCR output when\\ndealing with such documents. In this work, we focus on Bulgarian, and we create\\nthe first benchmark dataset for evaluating the OCR text correction for\\nhistorical Bulgarian documents written in the first standardized Bulgarian\\northography: the Drinov orthography from the 19th century. We further develop a\\nmethod for automatically generating synthetic data in this orthography, as well\\nas in the subsequent Ivanchev orthography, by leveraging vast amounts of\\ncontemporary literature Bulgarian texts. We then use state-of-the-art LLMs and\\nencoder-decoder framework which we augment with diagonal attention loss and\\ncopy and coverage mechanisms to improve the post-OCR text correction. The\\nproposed method reduces the errors introduced during recognition and improves\\nthe quality of the documents by 25\\\\%, which is an increase of 16\\\\% compared to\\nthe state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data\\nand code at \\\\url{https://github.com/angelbeshirov/post-ocr-text-correction}.}\",\"PeriodicalId\":501285,\"journal\":{\"name\":\"arXiv - CS - Digital Libraries\",\"volume\":\"4 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Digital Libraries\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.00527\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00527","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Post-OCR Text Correction for Bulgarian Historical Documents
The digitization of historical documents is crucial for preserving the
cultural heritage of the society. An important step in this process is
converting scanned images to text using Optical Character Recognition (OCR),
which can enable further search, information extraction, etc. Unfortunately,
this is a hard problem as standard OCR tools are not tailored to deal with
historical orthography as well as with challenging layouts. Thus, it is
standard to apply an additional text correction step on the OCR output when
dealing with such documents. In this work, we focus on Bulgarian, and we create
the first benchmark dataset for evaluating the OCR text correction for
historical Bulgarian documents written in the first standardized Bulgarian
orthography: the Drinov orthography from the 19th century. We further develop a
method for automatically generating synthetic data in this orthography, as well
as in the subsequent Ivanchev orthography, by leveraging vast amounts of
contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and
encoder-decoder framework which we augment with diagonal attention loss and
copy and coverage mechanisms to improve the post-OCR text correction. The
proposed method reduces the errors introduced during recognition and improves
the quality of the documents by 25\%, which is an increase of 16\% compared to
the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data
and code at \url{https://github.com/angelbeshirov/post-ocr-text-correction}.}