J. Lang. Technol. Comput. Linguistics最新文献

Aufbau eines Referenzkorpus zur deutschsprachigen internetbasierten Kommunikation als Zusatzkomponente für die Korpora im Projekt 'Digitales Wörterbuch der deutschen Sprache' (DWDS) 建立一个基于互联网的德语交流参考文件，作为“德语网词典”项目中企业的补充。

J. Lang. Technol. Comput. Linguistics Pub Date : 2022-12-05 DOI: 10.21248/jlcl.28.2013.174

Michael Beißwenger, L. Lemnitzer

引用次数: 7

Supervised OCR Error Detection and Correction Using Statistical and Neural Machine Translation Methods 使用统计和神经机器翻译方法的监督OCR错误检测和校正

J. Lang. Technol. Comput. Linguistics Pub Date : 2018-07-01 DOI: 10.5167/UZH-162394

Chantal Amrhein, S. Clematide

{"title":"Supervised OCR Error Detection and Correction Using Statistical and Neural Machine Translation Methods","authors":"Chantal Amrhein, S. Clematide","doi":"10.5167/UZH-162394","DOIUrl":"https://doi.org/10.5167/UZH-162394","url":null,"abstract":"For indexing the content of digitized historical texts, optical character recognition (OCR) errors are a hampering problem. To explore the effectivity of new strategies for OCR post-correction, this article focuses on methods of character-based machine translation, specifically neural machine translation and statistical machine translation. Using the ICDAR 2017 data set on OCR post-correction for English and French, we experiment with different strategies for error detection and error correction. We analyze how OCR post-correction with NMT can profit from using additional information and show that SMT and NMT can benefit from each other for these tasks. An ensemble of our models reached best performance in ICDAR’s 2017 error correction subtask and performed competitively in error detection. However, our experimental results also suggest that tuning supervised learning for OCR post-correction of texts from different sources, text types (periodicals and monographs), time periods and languages is a difficult task: the data on which the MT systems are trained have a large influence on which methods and features work best. Conclusive and generally applicable insights are hard to achieve.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133791771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus 使用开源引擎Calamari和OCRopus对早期印刷书籍OCR准确率的比较

J. Lang. Technol. Comput. Linguistics Pub Date : 2018-07-01 DOI: 10.21248/jlcl.33.2018.219

C. Wick, Christian Reul, F. Puppe

{"title":"Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus","authors":"C. Wick, Christian Reul, F. Puppe","doi":"10.21248/jlcl.33.2018.219","DOIUrl":"https://doi.org/10.21248/jlcl.33.2018.219","url":null,"abstract":"This paper proposes a combination of a convolutional and an LSTM network to improve the accuracy of OCR on early printed books. While the default approach of line based OCR is to use a single LSTM layer as provided by the well-established OCR software OCRopus (OCRopy), we utilize a CNN-and Pooling-Layer combination in advance of an LSTM layer as implemented by the novel OCR software Calamari. Since historical prints often require book speci ﬁ c models trained on manually labeled ground truth (GT) the goal is to maximize the recognition accuracy of a trained model while keeping the needed manual e ﬀ ort to a minimum. We show, that the deep model signi ﬁ cantly outperforms the shallow LSTM network when using both many and only a few training examples, although the deep network has a higher amount of trainable parameters. Hereby, the error rate is reduced by a factor of up to 55%, yielding character error rates (CER) of 1% and below for 1,000 lines of training. To further improve the results, we apply a con ﬁ dence voting mechanism to achieve CERs below 0 . 5%. A simple data augmentation scheme and the usage of pretrained models reduces the CER further by up to 62% if only few training data is available. Thus, we require only 100 lines of GT to reach an average CER of 1.2%. The runtime of the deep model for training and prediction of a book behaves very similar to a shallow network when trained on a CPU. However, the usage of a GPU, as supported by Calamari, reduces the prediction time by a factor of at least four and the training time by more than six.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125781681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Crowdsourcing the OCR Ground Truth of a German and French Cultural Heritage Corpus 众包德国和法国文化遗产语料库的OCR基础真相

J. Lang. Technol. Comput. Linguistics Pub Date : 2018-07-01 DOI: 10.21248/jlcl.33.2018.217

S. Clematide, Lenz Furrer, M. Volk

{"title":"Crowdsourcing the OCR Ground Truth of a German and French Cultural Heritage Corpus","authors":"S. Clematide, Lenz Furrer, M. Volk","doi":"10.21248/jlcl.33.2018.217","DOIUrl":"https://doi.org/10.21248/jlcl.33.2018.217","url":null,"abstract":"Crowdsourcing approaches for post-correction of OCR output (Optical Character Recognition) have been successfully applied to several historical text collections. We report on our crowd-correction platform Kokos, which we built to improve the OCR quality of the digitized yearbooks of the Swiss Alpine Club (SAC) from the 19th century. This multilingual heritage corpus consists of Alpine texts mainly written in German and French, all typeset in Antiqua font. Finding and engaging volunteers for correcting large amounts of pages into high quality text requires a carefully designed user interface, an easy-to-use workflow, and continuous efforts for keeping the participants motivated. More than 180,000 characters on about 21,000 pages were corrected by volunteers in about 7 months, achieving an OCR ground truth with a systematically evaluated accuracy of 99.7 % on the word level. The crowdsourced OCR ground truth and the corresponding original OCR recognition results from Abbyy FineReader for each page are available as a resource for machine learning and evaluation. Additionally, the scanned images (300 dpi) of all pages are included to enable tests with other OCR software.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"266 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123450435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin Ground Truth用于在德国尖角字体和早期现代拉丁语的历史文献上训练OCR引擎

J. Lang. Technol. Comput. Linguistics Pub Date : 2018-07-01 DOI: 10.21248/jlcl.33.2018.220

U. Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter

引用次数: 31

Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning 通过结合预训练、投票和主动学习来提高早期印刷书籍的OCR准确性

J. Lang. Technol. Comput. Linguistics Pub Date : 2018-02-27 DOI: 10.21248/jlcl.33.2018.216

Christian Reul, U. Springmann, C. Wick, F. Puppe

{"title":"Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning","authors":"Christian Reul, U. Springmann, C. Wick, F. Puppe","doi":"10.21248/jlcl.33.2018.216","DOIUrl":"https://doi.org/10.21248/jlcl.33.2018.216","url":null,"abstract":"We combine three methods which significantly improve the OCR accuracy of OCR models trained on early printed books: (1) The pretraining method utilizes the information stored in already existing models trained on a variety of typesets (mixed models) instead of starting the training from scratch. (2) Performing cross fold training on a single set of ground truth data (line images and their transcriptions) with a single OCR engine (OCRopus) produces a committee whose members then vote for the best outcome by also taking the top-N alternatives and their intrinsic confidence values into account. (3) Following the principle of maximal disagreement we select additional training lines which the voters disagree most on, expecting them to offer the highest information gain for a subsequent training (active learning). Evaluations on six early printed books yielded the following results: On average the combination of pretraining and voting improved the character accuracy by 46% when training five folds starting from the same mixed model. This number rose to 53% when using different models for pretraining, underlining the importance of diverse voters. Incorporating active learning improved the obtained results by another 16% on average (evaluated on three of the six books). Overall, the proposed methods lead to an average error rate of 2.5% when training on only 60 lines. Using a substantial ground truth pool of 1,000 lines brought the error rate down even further to less than 1% on average.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"300 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133991345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

A Survey and Comparative Study of Arabic Diacritization Tools 阿拉伯语变音符工具的调查与比较研究

J. Lang. Technol. Comput. Linguistics Pub Date : 2017-07-01 DOI: 10.21248/jlcl.32.2017.213

O. Hamed, Torsten Zesch

引用次数: 24

Tagging Classical Arabic Text using Available Morphological Analysers and Part of Speech Taggers 使用可用的形态分析器和词性标记器标记古典阿拉伯语文本

J. Lang. Technol. Comput. Linguistics Pub Date : 2017-07-01 DOI: 10.21248/jlcl.32.2017.212

A. Alosaimy, E. Atwell

引用次数: 27

Relativisation across varieties: A corpus analysis of Arabic texts 跨品种的相对化:阿拉伯语文本的语料库分析

J. Lang. Technol. Comput. Linguistics Pub Date : 2017-07-01 DOI: 10.21248/jlcl.32.2017.214

Zainab Al-Zaghir

引用次数: 0

Towards Interactive Multidimensional Visualisations for Corpus Linguistics 面向语料库语言学的交互式多维可视化

J. Lang. Technol. Comput. Linguistics Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.200

Paul Rayson, J. Mariani, Bryce Anderson-Cooper, Alistair Baron, David Gullick, Andrew Moore, Stephen Wattam

引用次数: 5