Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage最新文献

筛选
英文 中文
Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache 结合OCR和排版分类的历史词典自动语义文本标注——以Daniel Sander的Wörterbuch der Deutschen Sprache为例
Christian Reul, S. Göttel, U. Springmann, C. Wick, Kay-Michael Würzner, F. Puppe
{"title":"Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache","authors":"Christian Reul, S. Göttel, U. Springmann, C. Wick, Kay-Michael Würzner, F. Puppe","doi":"10.1145/3322905.3322910","DOIUrl":"https://doi.org/10.1145/3322905.3322910","url":null,"abstract":"When converting historical lexica into electronic form the goal is not only to obtain a high quality OCR result for the text but also to perform a precise automatic recognition of typographical attributes in order to capture the logical structure. For that purpose, we present a method that enables a fine-grained typography classification by training an open source OCR engine both on traditional OCR and typography recognition and show how to map the obtained typography information to the OCR recognized text output. As a test case, we used a German dictionary (Sander's Wörterbuch der Deutschen Sprache) from the 19th century, which comprises a particularly complex semantic function of typography. Despite the very challenging material, we achieved a character error rate below 0.4% and a typography recognition that assigns the correct label to close to 99% of the words. In contrast to many existing methods, our novel approach works with real historical data and can deal with frequent typography changes even within lines.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116814013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Challenges of Mass OCR-isation of Medieval Latin Texts in a Resource-Limited Project 在一个资源有限的项目中,中世纪拉丁文本大规模ocr化的挑战
Bruno Bon, Krzysztof Nowak, Laura Vangone
{"title":"Challenges of Mass OCR-isation of Medieval Latin Texts in a Resource-Limited Project","authors":"Bruno Bon, Krzysztof Nowak, Laura Vangone","doi":"10.1145/3322905.3322925","DOIUrl":"https://doi.org/10.1145/3322905.3322925","url":null,"abstract":"This paper aims to present the first stage of the ANR project VELUM (Towards Innovative Ways of Visualising, Exploring and Linking Resources for Medieval Latin) which, by 2022, is intended to compile the largest representative corpus of Medieval Latin texts. The corpus, which is to comprise 150 millions tokens, is expected to provide selected texts from four centuries of Latin written production (from 800 to 1200 AD) from all across Europe. It will also cover a wide gamut of genres from theological texts to historiography, to documents and letters. In the first stage of the project, that started in the mid-2018, we are selecting the texts to be included in the corpus, basing on the metadata in the electronic database of Medieval Latin texts that is, at the moment, the largest scholarly-driven source of information of this kind available free on the Internet. Once selected, the texts are retrieved from existing collections and digital libraries. As early tests showed, less than a half of the texts already exist in interoperable formats such as TEI XML, or at least in a form that allows for easy conversion which does not require human intervention. This means that the bulk of the corpus texts has to be acquired from digital images of editions available on-line through OCR and post-processing. For both tasks, there now exists a broad range of efficient tools, and many sophisticated workflows were proposed in literature. However, the presented project is significantly limited when it comes to its resources, since one person is expected to work on controlling the process and improving OCR quality during a single year. In the presentation we would like, first, to demonstrate the workflow of the project which, at the moment, consists of the 1) image extraction from PDF files, 2) image cleaning, and its subsequent 3) OCR, followed by 4) the batch-correction of the OCR errors, and 5) the removal of the non-Latin text with a simple classifier. The tools we use are all free and open source, an important factor in a project which is low on resources but ambitious in its goals. The PDF extraction and conversion are performed with Linux 'convert' and 'pdfimages' commands. The output TIFFs are cleaned with the \"ScanTailor\", while the OCR is realised with \"Tesseract\". To save on time, the entire workflow is automated, with the human analyst verifying the quality of the output and mass-correcting OCR errors with the \"Post Correction Tool\". Apart from presenting the project and the workflow, the paper will discuss the challenges we have faced. One of the most problematic issues turned out to be the relatively disparate quality of the image files retrieved from online sources. Another factor that significantly hinders the automatic processing was the quality of text editions.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130734132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using lexicography to characterise relations between species mentions in the biodiversity literature 使用词典学来描述生物多样性文献中提到的物种之间的关系
Sandra Young
{"title":"Using lexicography to characterise relations between species mentions in the biodiversity literature","authors":"Sandra Young","doi":"10.1145/3322905.3322918","DOIUrl":"https://doi.org/10.1145/3322905.3322918","url":null,"abstract":"The biodiversity literature is one of the longest-standing examples of recording heritage in the world. Today there are many efforts to standardise and integrate the literature to ensure access to the information, both for heritage and research purposes. Ontologies are increasingly being turned to as knowledge representation tools in these efforts. However, the validity of using ontological frameworks to represent biological taxonomies has been questioned. Biological taxonomies use the scientific nomenclature to assign names to described species. While the nomenclature is a useful classification tool, it can also be a source of confusion because of its synonymous, homonymous and fluid nature. Despite this, no empirical evaluation of scientific nomenclature use in the literature has ever been performed. Corpus-based analysis is already used in automatic ontology extraction, and this study explores the possibility of applying recently developed lexicography techniques to the problem to provide an evaluation of the empirical data in the literature, and serve as a comparison with existing ontologies. This paper focuses on the work flow, parameters and preliminary findings of the research investigating how to extract structures from the literature to perform these comparisons. It uses the manipulation of corpus analysis techniques, visualisation and filtering methods to do so and evaluates potential classification and disambiguation qualities of the resulting graphs for future work. Preliminary results look at the effects of frequency and salience when filtering the graphs, which indicate that these filter parameters could be used for different purposes in revealing relationships between organism mentions.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131714744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OCR for Greek polytonic (multi accent) historical printed documents: development, optimization and quality control 希腊多音(多口音)历史印刷文献的OCR:开发、优化和质量控制
Anna-Maria Sichani, Panagiotis Kaddas, Georgios K. Mikros, B. Gatos
{"title":"OCR for Greek polytonic (multi accent) historical printed documents: development, optimization and quality control","authors":"Anna-Maria Sichani, Panagiotis Kaddas, Georgios K. Mikros, B. Gatos","doi":"10.1145/3322905.3322926","DOIUrl":"https://doi.org/10.1145/3322905.3322926","url":null,"abstract":"This paper presents the development and implementation of a robust OCR tool and a related comprehensive workflow for the recognition of Greek printed polytonic scripts. This project is initiated and developed by an interdisciplinary team with expertise in the areas of document image processing, character segmentation and recognition, machine learning, corpus creation and digital humanities. Our paper aims to describe the design and development of the workflow around this project, including data gathering and structuring, OCR tool development, user interface development, experiments on the training procedure of the tool, evaluation, post-correction and quality control of the results.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114278391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Hidden Metadata in Plain Sights: Romanian Folklore Catalogues 隐藏的元数据在平原的视线:罗马尼亚民俗目录
Liviu Pop
{"title":"Hidden Metadata in Plain Sights: Romanian Folklore Catalogues","authors":"Liviu Pop","doi":"10.1145/3322905.3322912","DOIUrl":"https://doi.org/10.1145/3322905.3322912","url":null,"abstract":"This paper explores the way in which the old catalogues from the Romanian folklore archives can be improved with updated information about two key aspects of the folklore collections: the informant/performer and the village/location where the recordings where made. Using the headers from three different archives we will show the solution found for enriching the equivalent to dc:creator and dc:coverage. We will particularly focus on the case study of a series of recordings from Western Transylvania and how the metadata was cleaned, spliced and improved.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121344469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
From Tribunal Archive to Digital Research Facility (TRIADO): Exploring ways to make archives accessible and useable 从法庭档案到数字研究设施(TRIADO):探索如何使档案可访问和使用
Arnoud Gorter, Rutger van Koert, I. Tames, Edwin Klijn, M. Scherer
{"title":"From Tribunal Archive to Digital Research Facility (TRIADO): Exploring ways to make archives accessible and useable","authors":"Arnoud Gorter, Rutger van Koert, I. Tames, Edwin Klijn, M. Scherer","doi":"10.1145/3322905.3322906","DOIUrl":"https://doi.org/10.1145/3322905.3322906","url":null,"abstract":"The TRIADO project (2016-2019) is a cooperation between Netwerk Oorlogsbronnen (coordinator), NIOD Institute for War, Holocaust and Genocide Studies, Huygens ING/KNAW Humanities Cluster and the National Archives of the Netherlands (Nationaal Archief). TRIADO explores technological strategies to transform analogue text-based archival collections into digital data that can be used for research. The first part of the project is about trying out new techniques to open up collections, the second part is a 'reality check' to explore the research potential of the data created. Increasingly, archives, libraries and museums (ALMs) digitize their analogue historical collections. Yet, in 2017 it was estimated that only approximately one tenth of all heritage collections in Europe have been digitized so far. There is still a large gap between the specific needs of the digital humanities-community and the digital 'raw materials' supplied by the ALMs. Text-based historical collections are potentially interesting to a wide range of different scientific disciplines, but so far - in case of the Netherlands - only a few digitized archives are equipped to be used for digital research. The main aim of TRIADO is to bridge this gap by performing a 'laboratory to reality'-check with the most frequently consulted WWII archive in the Netherlands: the Central Archive of Special Jurisdiction (CABR). The CABR held by the Nationaal Archief (National Archives of the Netherlands) consists of the legal case files of some 300,000 persons accused of collaborating with the German occupier. The CABR contains approximately 4 kilometers of analogue documents (shelf space), ranging from minutes and verdicts to membership cards, forms and summons. Most documents are typed or hybrid (typed/handwritten). The experimental pilot project TRIADO focuses on two complementary research questions: 1. Which digital methods are best suited (in terms of quality, efficiency, etc) to make large corpora of unstructured, imperfect data, based on analogue collections, usable as a research facility? 2. Is it possible to answer specific, mainly quantitative statistical research questions on the basis of the digital data created under 1? A sample of 13.8 meters from the CABR was digitized to test technologies and perform experiments. Also, a workflow for mass digitization was devised and a demonstrator was built to showcase the results of the experiments. In this paper we discuss the main findings of the research done in part 1. This paper reports on processes for mass digitization, OCR quality and improvement, auto-classification of document types, named entity recognition, date extraction and matching of existing name lists to OCR'd data.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"731 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125331557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Towards a Generic Unsupervised Method for Transcription of Encoded Manuscripts 一种用于编码手稿转录的通用无监督方法
Arnau Baró, Jialuo Chen, A. Fornés, Beáta Megyesi
{"title":"Towards a Generic Unsupervised Method for Transcription of Encoded Manuscripts","authors":"Arnau Baró, Jialuo Chen, A. Fornés, Beáta Megyesi","doi":"10.1145/3322905.3322920","DOIUrl":"https://doi.org/10.1145/3322905.3322920","url":null,"abstract":"Historical ciphers, a special type of manuscripts, contain encrypted information, important for the interpretation of our history. The first step towards decipherment is to transcribe the images, either manually or by automatic image processing techniques. Despite the improvements in handwritten text recognition (HTR) thanks to deep learning methodologies, the need of labelled data to train is an important limitation. Given that ciphers often use symbol sets across various alphabets and unique symbols without any transcription scheme available, these supervised HTR techniques are not suitable to transcribe ciphers. In this paper we propose an un-supervised method for transcribing encrypted manuscripts based on clustering and label propagation, which has been successfully applied to community detection in networks. We analyze the performance on ciphers with various symbol sets, and discuss the advantages and drawbacks compared to supervised HTR methods.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121432241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Diamonds in Borneo: Commodities as Concepts in Context 婆罗洲的钻石:商品在语境中的概念
K. Hofmeester, A. Ashkpour, K. Depuydt, J. Does
{"title":"Diamonds in Borneo: Commodities as Concepts in Context","authors":"K. Hofmeester, A. Ashkpour, K. Depuydt, J. Does","doi":"10.1145/3322905.3322924","DOIUrl":"https://doi.org/10.1145/3322905.3322924","url":null,"abstract":"The intensified circulation of people, commodities and ideas is one of the characteristics of a globalizing world. To understand the causes and consequences of these circulations, we have to know which commodities circulated when and where, on what scale and who made them circulate. In our paper we want to present the first results of a CLARIAH Research Pilot1 on diamonds in Borneo, using the large historical newspaper collection of the KB (Royal Library of the Netherlands) in Delpher2. So far the diamond industry in Borneo has been a true blind spot in our knowledge on the global diamond commodity chain. We have little information on where diamonds were found, who the miners and traders were and if there was really an 'age-old' diamond polishing industry as literature suggests. We believe that the newspapers can provide more information on this topic. To answer these questions, we developed a workflow that enables us to query the KB newspaper collection in an efficient and elaborate way that can also be used for research on other commodities.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132036321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatic Reconstruction of Emperor Itineraries from the Regesta Imperii 《帝国纪事》中皇帝行程表的自动重建
J. Opitz, Leo Born, Vivi Nastase, Yannick Pultar
{"title":"Automatic Reconstruction of Emperor Itineraries from the Regesta Imperii","authors":"J. Opitz, Leo Born, Vivi Nastase, Yannick Pultar","doi":"10.1145/3322905.3322921","DOIUrl":"https://doi.org/10.1145/3322905.3322921","url":null,"abstract":"Historic itinerary research investigates the traveling paths of historic entities, to determine their influence and reach. A potential source of such information are the Regesta Imperii (RI), a large-scale resource for European medieval history research. However, two important intermediate problems must be addressed: 1. place names may be stated as unknown or are left empty; 2., place name queries return large candidate sets of points scattered all across Europe and the correct point must be selected. For 1., we perform a place name completion step to predict place names for regests referencing charters of unknown origin. To address 2., we formulate a graph framework which allows efficient reconstruction of the emperors' itineraries by means of shortest path finding algorithms. Our experiments show that our method predicts coordinates of places with significant correlation to human gold coordinates and significantly outperforms a baseline which selects points randomly from the candidate sets. We further show that the method can be leveraged to detect errors in human coordinate labels of place names.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130931424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Arabic-SOS: Segmentation, Stemming, and Orthography Standardization for Classical and pre-Modern Standard Arabic 阿拉伯语- sos:古典和前现代标准阿拉伯语的分词、词干和正字法标准化
Emad Mohamed, Z. Sayyed
{"title":"Arabic-SOS: Segmentation, Stemming, and Orthography Standardization for Classical and pre-Modern Standard Arabic","authors":"Emad Mohamed, Z. Sayyed","doi":"10.1145/3322905.3322927","DOIUrl":"https://doi.org/10.1145/3322905.3322927","url":null,"abstract":"While morphological segmentation has always been a hot topic in Arabic, due to the morphological complexity of the language and the orthography, most effort has focused on Modern Standard Arabic. In this paper, we focus on pre-MSA texts. We use the Gradient Boosting algorithm to train a morphological segmenter with a corpus derived from Al-Manar, a late 19th/early 20th century magazine that focused on the Arabic and Islamic heritage. Since most of the cultural heritage Arabic available suffers from substandard orthography, we have trained a machine learner to standardize the text. Our segmentation accuracy reaches 98.47%, and the orthography standardization an F-macro of 0.98 and an F-micro of 0.99. We also produce stemming as a by-product of segmentation.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134292241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信