Tobias Englmeier, F. Fink, U. Springmann, K. Schulz
{"title":"任意OCR-ed历史文本自动后校正模型的优化训练","authors":"Tobias Englmeier, F. Fink, U. Springmann, K. Schulz","doi":"10.21248/jlcl.35.2022.232","DOIUrl":null,"url":null,"abstract":"Systems for post-correction of OCR-results for historical texts are based on statistical correction models obtained by supervised learning. For training, suitable collections of ground truth materials are needed. In this paper we investigate the dependency of the power of automated OCR post-correction on the form of ground truth data and other training settings used for the computation of a post-correction model. The post-correction system A-PoCoTo considered here is based on a profiler service that computes a statistical profile for an OCR-ed input text. We also look in detail at the influence of the profiler resources and other settings selected for training and evaluation. As a practical result of several fine-tuning steps, a general post-correction model is achieved where experiments for a large and heterogeneous collection of OCR-ed historical texts show a consistent improvement of base OCR accuracy. The results presented are meant to provide insights for libraries that want to apply OCR post-correction to a larger spectrum of distinct OCR-ed historical printings and ask for \"representative\" results.","PeriodicalId":137584,"journal":{"name":"Journal for Language Technology and Computational Linguistics","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Optimizing the Training of Models for Automated Post-Correction of Arbitrary OCR-ed Historical Texts\",\"authors\":\"Tobias Englmeier, F. Fink, U. Springmann, K. Schulz\",\"doi\":\"10.21248/jlcl.35.2022.232\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Systems for post-correction of OCR-results for historical texts are based on statistical correction models obtained by supervised learning. For training, suitable collections of ground truth materials are needed. In this paper we investigate the dependency of the power of automated OCR post-correction on the form of ground truth data and other training settings used for the computation of a post-correction model. The post-correction system A-PoCoTo considered here is based on a profiler service that computes a statistical profile for an OCR-ed input text. We also look in detail at the influence of the profiler resources and other settings selected for training and evaluation. As a practical result of several fine-tuning steps, a general post-correction model is achieved where experiments for a large and heterogeneous collection of OCR-ed historical texts show a consistent improvement of base OCR accuracy. The results presented are meant to provide insights for libraries that want to apply OCR post-correction to a larger spectrum of distinct OCR-ed historical printings and ask for \\\"representative\\\" results.\",\"PeriodicalId\":137584,\"journal\":{\"name\":\"Journal for Language Technology and Computational Linguistics\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal for Language Technology and Computational Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21248/jlcl.35.2022.232\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal for Language Technology and Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21248/jlcl.35.2022.232","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Optimizing the Training of Models for Automated Post-Correction of Arbitrary OCR-ed Historical Texts
Systems for post-correction of OCR-results for historical texts are based on statistical correction models obtained by supervised learning. For training, suitable collections of ground truth materials are needed. In this paper we investigate the dependency of the power of automated OCR post-correction on the form of ground truth data and other training settings used for the computation of a post-correction model. The post-correction system A-PoCoTo considered here is based on a profiler service that computes a statistical profile for an OCR-ed input text. We also look in detail at the influence of the profiler resources and other settings selected for training and evaluation. As a practical result of several fine-tuning steps, a general post-correction model is achieved where experiments for a large and heterogeneous collection of OCR-ed historical texts show a consistent improvement of base OCR accuracy. The results presented are meant to provide insights for libraries that want to apply OCR post-correction to a larger spectrum of distinct OCR-ed historical printings and ask for "representative" results.