DeReKoGram:一种新的德语引理和词性信息频率数据集

IF 2 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Pub Date : 2023-11-10 DOI:10.3390/data8110170

Sascha Wolfer, Alexander Koplenig, Marc Kupietz, Carolin Müller-Spitzer

{"title":"DeReKoGram:一种新的德语引理和词性信息频率数据集","authors":"Sascha Wolfer, Alexander Koplenig, Marc Kupietz, Carolin Müller-Spitzer","doi":"10.3390/data8110170","DOIUrl":null,"url":null,"abstract":"We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.","PeriodicalId":36824,"journal":{"name":"Data","volume":" 43","pages":"0"},"PeriodicalIF":2.0000,"publicationDate":"2023-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German\",\"authors\":\"Sascha Wolfer, Alexander Koplenig, Marc Kupietz, Carolin Müller-Spitzer\",\"doi\":\"10.3390/data8110170\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.\",\"PeriodicalId\":36824,\"journal\":{\"name\":\"Data\",\"volume\":\" 43\",\"pages\":\"0\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2023-11-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/data8110170\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/data8110170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

我们介绍了DeReKoGram，这是一个新的频率数据集，包含来自德语参考语料库的1-，2-和3-g的引理和词性(POS)信息。该数据集包含基于432亿个token的语料库的信息，并基于16个语料库折叠分为16个部分。我们描述了数据集是如何创建和结构化的。通过评估16个折叠的分布，我们展示了在许多用例中使用折叠的子集是可能的(例如，为了节省计算资源)。在一个案例研究中，我们研究了随着分析中包含的折叠数量的增加，词汇量的增长(以及偶合现象的数量)。我们将其与数据集的各个清理阶段交叉结合。我们还以Python、R和Stata markdown脚本的形式提供了一些关于如何使用该资源的指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German

We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Data Decision Sciences-Information Systems and Management

CiteScore

4.30

自引率

3.80%

发文量

审稿时长

10 weeks