Challenges of Mass OCR-isation of Medieval Latin Texts in a Resource-Limited Project

Bruno Bon, Krzysztof Nowak, Laura Vangone
{"title":"Challenges of Mass OCR-isation of Medieval Latin Texts in a Resource-Limited Project","authors":"Bruno Bon, Krzysztof Nowak, Laura Vangone","doi":"10.1145/3322905.3322925","DOIUrl":null,"url":null,"abstract":"This paper aims to present the first stage of the ANR project VELUM (Towards Innovative Ways of Visualising, Exploring and Linking Resources for Medieval Latin) which, by 2022, is intended to compile the largest representative corpus of Medieval Latin texts. The corpus, which is to comprise 150 millions tokens, is expected to provide selected texts from four centuries of Latin written production (from 800 to 1200 AD) from all across Europe. It will also cover a wide gamut of genres from theological texts to historiography, to documents and letters. In the first stage of the project, that started in the mid-2018, we are selecting the texts to be included in the corpus, basing on the metadata in the electronic database of Medieval Latin texts that is, at the moment, the largest scholarly-driven source of information of this kind available free on the Internet. Once selected, the texts are retrieved from existing collections and digital libraries. As early tests showed, less than a half of the texts already exist in interoperable formats such as TEI XML, or at least in a form that allows for easy conversion which does not require human intervention. This means that the bulk of the corpus texts has to be acquired from digital images of editions available on-line through OCR and post-processing. For both tasks, there now exists a broad range of efficient tools, and many sophisticated workflows were proposed in literature. However, the presented project is significantly limited when it comes to its resources, since one person is expected to work on controlling the process and improving OCR quality during a single year. In the presentation we would like, first, to demonstrate the workflow of the project which, at the moment, consists of the 1) image extraction from PDF files, 2) image cleaning, and its subsequent 3) OCR, followed by 4) the batch-correction of the OCR errors, and 5) the removal of the non-Latin text with a simple classifier. The tools we use are all free and open source, an important factor in a project which is low on resources but ambitious in its goals. The PDF extraction and conversion are performed with Linux 'convert' and 'pdfimages' commands. The output TIFFs are cleaned with the \"ScanTailor\", while the OCR is realised with \"Tesseract\". To save on time, the entire workflow is automated, with the human analyst verifying the quality of the output and mass-correcting OCR errors with the \"Post Correction Tool\". Apart from presenting the project and the workflow, the paper will discuss the challenges we have faced. One of the most problematic issues turned out to be the relatively disparate quality of the image files retrieved from online sources. Another factor that significantly hinders the automatic processing was the quality of text editions.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3322905.3322925","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This paper aims to present the first stage of the ANR project VELUM (Towards Innovative Ways of Visualising, Exploring and Linking Resources for Medieval Latin) which, by 2022, is intended to compile the largest representative corpus of Medieval Latin texts. The corpus, which is to comprise 150 millions tokens, is expected to provide selected texts from four centuries of Latin written production (from 800 to 1200 AD) from all across Europe. It will also cover a wide gamut of genres from theological texts to historiography, to documents and letters. In the first stage of the project, that started in the mid-2018, we are selecting the texts to be included in the corpus, basing on the metadata in the electronic database of Medieval Latin texts that is, at the moment, the largest scholarly-driven source of information of this kind available free on the Internet. Once selected, the texts are retrieved from existing collections and digital libraries. As early tests showed, less than a half of the texts already exist in interoperable formats such as TEI XML, or at least in a form that allows for easy conversion which does not require human intervention. This means that the bulk of the corpus texts has to be acquired from digital images of editions available on-line through OCR and post-processing. For both tasks, there now exists a broad range of efficient tools, and many sophisticated workflows were proposed in literature. However, the presented project is significantly limited when it comes to its resources, since one person is expected to work on controlling the process and improving OCR quality during a single year. In the presentation we would like, first, to demonstrate the workflow of the project which, at the moment, consists of the 1) image extraction from PDF files, 2) image cleaning, and its subsequent 3) OCR, followed by 4) the batch-correction of the OCR errors, and 5) the removal of the non-Latin text with a simple classifier. The tools we use are all free and open source, an important factor in a project which is low on resources but ambitious in its goals. The PDF extraction and conversion are performed with Linux 'convert' and 'pdfimages' commands. The output TIFFs are cleaned with the "ScanTailor", while the OCR is realised with "Tesseract". To save on time, the entire workflow is automated, with the human analyst verifying the quality of the output and mass-correcting OCR errors with the "Post Correction Tool". Apart from presenting the project and the workflow, the paper will discuss the challenges we have faced. One of the most problematic issues turned out to be the relatively disparate quality of the image files retrieved from online sources. Another factor that significantly hinders the automatic processing was the quality of text editions.
在一个资源有限的项目中,中世纪拉丁文本大规模ocr化的挑战
本文旨在介绍ANR项目VELUM(面向中世纪拉丁语资源可视化、探索和链接的创新方法)的第一阶段,该项目到2022年将编制最大的代表性中世纪拉丁语文本语料库。该语料库将包含1.5亿个符号,预计将提供来自整个欧洲四个世纪的拉丁语书面作品(从公元800年到1200年)的精选文本。它还将涵盖从神学文本到史学,到文件和信件的广泛类型。在项目的第一阶段,从2018年中期开始,我们正在选择要包含在语料库中的文本,基于中世纪拉丁文本电子数据库中的元数据,这是目前在互联网上免费提供的最大的学术驱动的信息来源。一旦选定,文本将从现有馆藏和数字图书馆中检索。正如早期的测试所显示的那样,只有不到一半的文本以可互操作的格式(如TEI XML)存在,或者至少以不需要人工干预就可以轻松转换的形式存在。这意味着大部分语料库文本必须通过OCR和后处理从在线可用版本的数字图像中获取。对于这两个任务,现在存在广泛的有效工具,并且在文献中提出了许多复杂的工作流程。然而,当涉及到它的资源时,所呈现的项目是非常有限的,因为一个人需要在一年内控制过程并改进OCR质量。在演示文稿中,我们首先要演示项目的工作流程,目前包括1)从PDF文件中提取图像,2)图像清洗,以及随后的3)OCR,然后是4)OCR错误的批量更正,以及5)使用简单分类器删除非拉丁文本。我们使用的工具都是免费和开源的,这是一个资源匮乏但目标远大的项目的重要因素。PDF提取和转换是用Linux 'convert'和'pdfimages'命令执行的。输出tiff是用“扫描器”清理,而OCR是用“Tesseract”实现的。为了节省时间,整个工作流程都是自动化的,由人工分析人员验证输出的质量,并使用“后期校正工具”大规模纠正OCR错误。除了介绍项目和工作流程外,论文还将讨论我们所面临的挑战。最成问题的问题之一是从在线资源中检索到的图像文件的质量相对不同。另一个严重阻碍自动处理的因素是文本版本的质量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信