Slovenian parliamentary corpus siParl

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation Pub Date : 2024-06-02 DOI:10.1007/s10579-024-09746-8

Katja Meden, Tomaž Erjavec, Andrej Pančur

{"title":"Slovenian parliamentary corpus siParl","authors":"Katja Meden, Tomaž Erjavec, Andrej Pančur","doi":"10.1007/s10579-024-09746-8","DOIUrl":null,"url":null,"abstract":"<p>Parliamentary debates represent an essential part of democratic discourse and provide insights into various socio-demographic and linguistic phenomena - parliamentary corpora, which contain transcripts of parliamentary debates and extensive metadata, are an important resource for parliamentary discourse analysis and other research areas. This paper presents the Slovenian parliamentary corpus siParl, the latest version of which contains transcripts of plenary sessions and other legislative bodies of the Assembly of the Republic of Slovenia from 1990 to 2022, comprising more than 1 million speeches and 210 million words. We outline the development history of the corpus and also mention other initiatives that have been influenced by siParl (such as the Parla-CLARIN encoding and the ParlaMint corpora of European parliaments), present the corpus creation process, ranging from the initial data collection to the structural development and encoding of the corpus, and given the growing influence of the ParlaMint corpora, compare siParl with the Slovenian ParlaMint-SI corpus. Finally, we discuss updates for the next version as well as the long-term development and enrichment of the siParl corpus.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"36 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09746-8","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Parliamentary debates represent an essential part of democratic discourse and provide insights into various socio-demographic and linguistic phenomena - parliamentary corpora, which contain transcripts of parliamentary debates and extensive metadata, are an important resource for parliamentary discourse analysis and other research areas. This paper presents the Slovenian parliamentary corpus siParl, the latest version of which contains transcripts of plenary sessions and other legislative bodies of the Assembly of the Republic of Slovenia from 1990 to 2022, comprising more than 1 million speeches and 210 million words. We outline the development history of the corpus and also mention other initiatives that have been influenced by siParl (such as the Parla-CLARIN encoding and the ParlaMint corpora of European parliaments), present the corpus creation process, ranging from the initial data collection to the structural development and encoding of the corpus, and given the growing influence of the ParlaMint corpora, compare siParl with the Slovenian ParlaMint-SI corpus. Finally, we discuss updates for the next version as well as the long-term development and enrichment of the siParl corpus.

Abstract Image

查看原文本刊更多论文

斯洛文尼亚议会语料库 siParl

议会辩论是民主话语的重要组成部分，可以深入了解各种社会人口和语言现象--议会语料库包含议会辩论的文字记录和大量元数据，是议会话语分析和其他研究领域的重要资源。本文介绍了斯洛文尼亚议会语料库 siParl，其最新版本包含 1990 年至 2022 年斯洛文尼亚共和国议会全体会议和其他立法机构会议的记录誊本，包含 100 多万份发言稿和 2.1 亿个单词。我们概述了该语料库的发展历史，还提到了受 siParl 影响的其他倡议（如 Parla-CLARIN 编码和欧洲议会 ParlaMint 语料库），介绍了从最初的数据收集到语料库的结构发展和编码的语料库创建过程，并鉴于 ParlaMint 语料库日益增长的影响力，将 siParl 与斯洛文尼亚 ParlaMint-SI 语料库进行了比较。最后，我们讨论了 siParl 语料库下一版的更新以及长期发展和充实。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Language Resources and Evaluation 工程技术-计算机：跨学科应用

CiteScore

6.50

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.