Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage最新文献_第2页

Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache 结合OCR和排版分类的历史词典自动语义文本标注——以Daniel Sander的Wörterbuch der Deutschen Sprache为例

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322910

Christian Reul, S. Göttel, U. Springmann, C. Wick, Kay-Michael Würzner, F. Puppe

{"title":"Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache","authors":"Christian Reul, S. Göttel, U. Springmann, C. Wick, Kay-Michael Würzner, F. Puppe","doi":"10.1145/3322905.3322910","DOIUrl":"https://doi.org/10.1145/3322905.3322910","url":null,"abstract":"When converting historical lexica into electronic form the goal is not only to obtain a high quality OCR result for the text but also to perform a precise automatic recognition of typographical attributes in order to capture the logical structure. For that purpose, we present a method that enables a fine-grained typography classification by training an open source OCR engine both on traditional OCR and typography recognition and show how to map the obtained typography information to the OCR recognized text output. As a test case, we used a German dictionary (Sander's Wörterbuch der Deutschen Sprache) from the 19th century, which comprises a particularly complex semantic function of typography. Despite the very challenging material, we achieved a character error rate below 0.4% and a typography recognition that assigns the correct label to close to 99% of the words. In contrast to many existing methods, our novel approach works with real historical data and can deal with frequent typography changes even within lines.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116814013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Challenges of Mass OCR-isation of Medieval Latin Texts in a Resource-Limited Project 在一个资源有限的项目中，中世纪拉丁文本大规模ocr化的挑战

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322925

Bruno Bon, Krzysztof Nowak, Laura Vangone

{"title":"Challenges of Mass OCR-isation of Medieval Latin Texts in a Resource-Limited Project","authors":"Bruno Bon, Krzysztof Nowak, Laura Vangone","doi":"10.1145/3322905.3322925","DOIUrl":"https://doi.org/10.1145/3322905.3322925","url":null,"abstract":"This paper aims to present the first stage of the ANR project VELUM (Towards Innovative Ways of Visualising, Exploring and Linking Resources for Medieval Latin) which, by 2022, is intended to compile the largest representative corpus of Medieval Latin texts. The corpus, which is to comprise 150 millions tokens, is expected to provide selected texts from four centuries of Latin written production (from 800 to 1200 AD) from all across Europe. It will also cover a wide gamut of genres from theological texts to historiography, to documents and letters. In the first stage of the project, that started in the mid-2018, we are selecting the texts to be included in the corpus, basing on the metadata in the electronic database of Medieval Latin texts that is, at the moment, the largest scholarly-driven source of information of this kind available free on the Internet. Once selected, the texts are retrieved from existing collections and digital libraries. As early tests showed, less than a half of the texts already exist in interoperable formats such as TEI XML, or at least in a form that allows for easy conversion which does not require human intervention. This means that the bulk of the corpus texts has to be acquired from digital images of editions available on-line through OCR and post-processing. For both tasks, there now exists a broad range of efficient tools, and many sophisticated workflows were proposed in literature. However, the presented project is significantly limited when it comes to its resources, since one person is expected to work on controlling the process and improving OCR quality during a single year. In the presentation we would like, first, to demonstrate the workflow of the project which, at the moment, consists of the 1) image extraction from PDF files, 2) image cleaning, and its subsequent 3) OCR, followed by 4) the batch-correction of the OCR errors, and 5) the removal of the non-Latin text with a simple classifier. The tools we use are all free and open source, an important factor in a project which is low on resources but ambitious in its goals. The PDF extraction and conversion are performed with Linux 'convert' and 'pdfimages' commands. The output TIFFs are cleaned with the \"ScanTailor\", while the OCR is realised with \"Tesseract\". To save on time, the entire workflow is automated, with the human analyst verifying the quality of the output and mass-correcting OCR errors with the \"Post Correction Tool\". Apart from presenting the project and the workflow, the paper will discuss the challenges we have faced. One of the most problematic issues turned out to be the relatively disparate quality of the image files retrieved from online sources. Another factor that significantly hinders the automatic processing was the quality of text editions.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130734132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Using lexicography to characterise relations between species mentions in the biodiversity literature 使用词典学来描述生物多样性文献中提到的物种之间的关系

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322918

Sandra Young

{"title":"Using lexicography to characterise relations between species mentions in the biodiversity literature","authors":"Sandra Young","doi":"10.1145/3322905.3322918","DOIUrl":"https://doi.org/10.1145/3322905.3322918","url":null,"abstract":"The biodiversity literature is one of the longest-standing examples of recording heritage in the world. Today there are many efforts to standardise and integrate the literature to ensure access to the information, both for heritage and research purposes. Ontologies are increasingly being turned to as knowledge representation tools in these efforts. However, the validity of using ontological frameworks to represent biological taxonomies has been questioned. Biological taxonomies use the scientific nomenclature to assign names to described species. While the nomenclature is a useful classification tool, it can also be a source of confusion because of its synonymous, homonymous and fluid nature. Despite this, no empirical evaluation of scientific nomenclature use in the literature has ever been performed. Corpus-based analysis is already used in automatic ontology extraction, and this study explores the possibility of applying recently developed lexicography techniques to the problem to provide an evaluation of the empirical data in the literature, and serve as a comparison with existing ontologies. This paper focuses on the work flow, parameters and preliminary findings of the research investigating how to extract structures from the literature to perform these comparisons. It uses the manipulation of corpus analysis techniques, visualisation and filtering methods to do so and evaluates potential classification and disambiguation qualities of the resulting graphs for future work. Preliminary results look at the effects of frequency and salience when filtering the graphs, which indicate that these filter parameters could be used for different purposes in revealing relationships between organism mentions.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131714744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

OCR for Greek polytonic (multi accent) historical printed documents: development, optimization and quality control 希腊多音(多口音)历史印刷文献的OCR:开发、优化和质量控制

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322926

Anna-Maria Sichani, Panagiotis Kaddas, Georgios K. Mikros, B. Gatos

引用次数: 1

Hidden Metadata in Plain Sights: Romanian Folklore Catalogues 隐藏的元数据在平原的视线:罗马尼亚民俗目录

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322912

Liviu Pop

引用次数: 0

From Tribunal Archive to Digital Research Facility (TRIADO): Exploring ways to make archives accessible and useable 从法庭档案到数字研究设施(TRIADO):探索如何使档案可访问和使用

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322906

Arnoud Gorter, Rutger van Koert, I. Tames, Edwin Klijn, M. Scherer

{"title":"From Tribunal Archive to Digital Research Facility (TRIADO): Exploring ways to make archives accessible and useable","authors":"Arnoud Gorter, Rutger van Koert, I. Tames, Edwin Klijn, M. Scherer","doi":"10.1145/3322905.3322906","DOIUrl":"https://doi.org/10.1145/3322905.3322906","url":null,"abstract":"The TRIADO project (2016-2019) is a cooperation between Netwerk Oorlogsbronnen (coordinator), NIOD Institute for War, Holocaust and Genocide Studies, Huygens ING/KNAW Humanities Cluster and the National Archives of the Netherlands (Nationaal Archief). TRIADO explores technological strategies to transform analogue text-based archival collections into digital data that can be used for research. The first part of the project is about trying out new techniques to open up collections, the second part is a 'reality check' to explore the research potential of the data created. Increasingly, archives, libraries and museums (ALMs) digitize their analogue historical collections. Yet, in 2017 it was estimated that only approximately one tenth of all heritage collections in Europe have been digitized so far. There is still a large gap between the specific needs of the digital humanities-community and the digital 'raw materials' supplied by the ALMs. Text-based historical collections are potentially interesting to a wide range of different scientific disciplines, but so far - in case of the Netherlands - only a few digitized archives are equipped to be used for digital research. The main aim of TRIADO is to bridge this gap by performing a 'laboratory to reality'-check with the most frequently consulted WWII archive in the Netherlands: the Central Archive of Special Jurisdiction (CABR). The CABR held by the Nationaal Archief (National Archives of the Netherlands) consists of the legal case files of some 300,000 persons accused of collaborating with the German occupier. The CABR contains approximately 4 kilometers of analogue documents (shelf space), ranging from minutes and verdicts to membership cards, forms and summons. Most documents are typed or hybrid (typed/handwritten). The experimental pilot project TRIADO focuses on two complementary research questions: 1. Which digital methods are best suited (in terms of quality, efficiency, etc) to make large corpora of unstructured, imperfect data, based on analogue collections, usable as a research facility? 2. Is it possible to answer specific, mainly quantitative statistical research questions on the basis of the digital data created under 1? A sample of 13.8 meters from the CABR was digitized to test technologies and perform experiments. Also, a workflow for mass digitization was devised and a demonstrator was built to showcase the results of the experiments. In this paper we discuss the main findings of the research done in part 1. This paper reports on processes for mass digitization, OCR quality and improvement, auto-classification of document types, named entity recognition, date extraction and matching of existing name lists to OCR'd data.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"731 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125331557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Towards a Generic Unsupervised Method for Transcription of Encoded Manuscripts 一种用于编码手稿转录的通用无监督方法

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322920

Arnau Baró, Jialuo Chen, A. Fornés, Beáta Megyesi

引用次数: 13

Diamonds in Borneo: Commodities as Concepts in Context 婆罗洲的钻石:商品在语境中的概念

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322924

K. Hofmeester, A. Ashkpour, K. Depuydt, J. Does

引用次数: 2

Automatic Reconstruction of Emperor Itineraries from the Regesta Imperii 《帝国纪事》中皇帝行程表的自动重建

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322921

J. Opitz, Leo Born, Vivi Nastase, Yannick Pultar

{"title":"Automatic Reconstruction of Emperor Itineraries from the Regesta Imperii","authors":"J. Opitz, Leo Born, Vivi Nastase, Yannick Pultar","doi":"10.1145/3322905.3322921","DOIUrl":"https://doi.org/10.1145/3322905.3322921","url":null,"abstract":"Historic itinerary research investigates the traveling paths of historic entities, to determine their influence and reach. A potential source of such information are the Regesta Imperii (RI), a large-scale resource for European medieval history research. However, two important intermediate problems must be addressed: 1. place names may be stated as unknown or are left empty; 2., place name queries return large candidate sets of points scattered all across Europe and the correct point must be selected. For 1., we perform a place name completion step to predict place names for regests referencing charters of unknown origin. To address 2., we formulate a graph framework which allows efficient reconstruction of the emperors' itineraries by means of shortest path finding algorithms. Our experiments show that our method predicts coordinates of places with significant correlation to human gold coordinates and significantly outperforms a baseline which selects points randomly from the candidate sets. We further show that the method can be leveraged to detect errors in human coordinate labels of place names.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130931424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Arabic-SOS: Segmentation, Stemming, and Orthography Standardization for Classical and pre-Modern Standard Arabic 阿拉伯语- sos:古典和前现代标准阿拉伯语的分词、词干和正字法标准化

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322927

Emad Mohamed, Z. Sayyed

引用次数: 1