Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI:10.1145/3322905.3322923

K. Depuydt, H. Brugman

{"title":"Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project","authors":"K. Depuydt, H. Brugman","doi":"10.1145/3322905.3322923","DOIUrl":null,"url":null,"abstract":"In this paper, we argue that exploitation of historical corpus data requires text metadata which metadata accompanying digital objects from digital libraries, archives or other electronic text collections, do not provide. Most text collections describe in their metadata the object (book, newspaper) containing the text. To do research on the style of an author, or study the language of a certain time period, or a phenomenon through time, correct metadata is needed for each word in the text, which leads to a very intricate metadata scheme for some text collections. We focus on the Nederlab corpus. Nederlab is a research environment that gives access to a large diachronic corpus of Dutch texts from the 6th - 21st century, of more than 10 billion words. The corpus has been compiled using existing digitised text material from researchers, research organisations, archives and libraries. The aim of Nederlab is to provide tools and data to enable researchers to trace long-term changes in Dutch language, culture and society. This type of research sets high-level requirements on the metadata accompanying the texts. Since the Nederlab corpus consists of different collections, each with their own metadata, the task of adding the appropriate metadata was not straightforward, all the more so because of the difference in perspective content providers and corpus builders have. We will describe the desired metadata scheme and how we tried to realize this for a corpus of the size of Nederlab.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3322905.3322923","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In this paper, we argue that exploitation of historical corpus data requires text metadata which metadata accompanying digital objects from digital libraries, archives or other electronic text collections, do not provide. Most text collections describe in their metadata the object (book, newspaper) containing the text. To do research on the style of an author, or study the language of a certain time period, or a phenomenon through time, correct metadata is needed for each word in the text, which leads to a very intricate metadata scheme for some text collections. We focus on the Nederlab corpus. Nederlab is a research environment that gives access to a large diachronic corpus of Dutch texts from the 6th - 21st century, of more than 10 billion words. The corpus has been compiled using existing digitised text material from researchers, research organisations, archives and libraries. The aim of Nederlab is to provide tools and data to enable researchers to trace long-term changes in Dutch language, culture and society. This type of research sets high-level requirements on the metadata accompanying the texts. Since the Nederlab corpus consists of different collections, each with their own metadata, the task of adding the appropriate metadata was not straightforward, all the more so because of the difference in perspective content providers and corpus builders have. We will describe the desired metadata scheme and how we tried to realize this for a corpus of the size of Nederlab.

查看原文本刊更多论文

将数字化材料转化为历时语料库:Nederlab项目中的元数据挑战

在本文中，我们认为利用历史语料库数据需要文本元数据，而来自数字图书馆、档案馆或其他电子文本集合的数字对象的元数据不提供这种元数据。大多数文本集合在其元数据中描述包含文本的对象(书、报纸)。为了研究作者的风格，或者研究某一时期的语言，或者一种跨越时间的现象，需要对文本中的每个单词进行正确的元数据处理，这导致一些文本集的元数据方案非常复杂。我们专注于Nederlab语料库。Nederlab是一个研究环境，可以访问从6世纪到21世纪的荷兰文本的大型历时语料库，超过100亿单词。该语料库是使用来自研究人员、研究机构、档案馆和图书馆的现有数字化文本材料编制的。Nederlab的目标是提供工具和数据，使研究人员能够追踪荷兰语言、文化和社会的长期变化。这种类型的研究对文本附带的元数据有很高的要求。由于Nederlab语料库由不同的集合组成，每个集合都有自己的元数据，因此添加适当元数据的任务并不简单，因为内容提供者和语料库构建者的透视图存在差异。我们将描述所需的元数据方案，以及我们如何尝试为Nederlab大小的语料库实现这一方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

自引率

0.00%

发文量