Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams

Kalbotyra Pub Date : 2016-03-30 DOI:10.15388/KLBT.2014.7674
Jurgita Kapočiūtė-Dzikienė, Andrius Utka, Ligita Šarkutė
{"title":"Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams","authors":"Jurgita Kapočiūtė-Dzikienė, Andrius Utka, Ligita Šarkutė","doi":"10.15388/KLBT.2014.7674","DOIUrl":null,"url":null,"abstract":"In our paper we present a corpus of transcribed Lithuanian parliamentary speeches. The corpus is prepared in a specific format, appropriate for different authorship identification tasks. The corpus consists of approximately 111 thousand texts (24 million words). Each text matches one parliamentary speech produced during an ordinary session from the period of 7 parliamentary terms starting on March 10, 1990 and ending on December 23, 2013. The texts are grouped into 147 categories corresponding to individual authors, therefore they can be used for authorship attribution tasks; besides, these texts are also grouped according to age, gender and political views, therefore they are also suitable for author profiling tasks. Whereas short texts complicate recognition of author speaking style and are ambiguous in relation to the style of other authors, we incorporated only texts containing not less than 100 words into the corpus. In order to make each category as comprehensive and representative as possible, we included only those authors, who produced speeches at least 200 times. All the texts are lemmatized, morphologically and syntactically annotated, tokenized into the character n-grams. The statistical information of the corpus is also available. We have also demonstrated that the created corpus can be effectively used in authorship attribution and author profiling tasks with supervised machine learning methods. The corpus structure also allows using it with unsupervised machine learning methods and can be used for creation of rule-based methods, as well as in different linguistic analyses.","PeriodicalId":30274,"journal":{"name":"Kalbotyra","volume":"66 1","pages":"27-45"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Kalbotyra","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15388/KLBT.2014.7674","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

In our paper we present a corpus of transcribed Lithuanian parliamentary speeches. The corpus is prepared in a specific format, appropriate for different authorship identification tasks. The corpus consists of approximately 111 thousand texts (24 million words). Each text matches one parliamentary speech produced during an ordinary session from the period of 7 parliamentary terms starting on March 10, 1990 and ending on December 23, 2013. The texts are grouped into 147 categories corresponding to individual authors, therefore they can be used for authorship attribution tasks; besides, these texts are also grouped according to age, gender and political views, therefore they are also suitable for author profiling tasks. Whereas short texts complicate recognition of author speaking style and are ambiguous in relation to the style of other authors, we incorporated only texts containing not less than 100 words into the corpus. In order to make each category as comprehensive and representative as possible, we included only those authors, who produced speeches at least 200 times. All the texts are lemmatized, morphologically and syntactically annotated, tokenized into the character n-grams. The statistical information of the corpus is also available. We have also demonstrated that the created corpus can be effectively used in authorship attribution and author profiling tasks with supervised machine learning methods. The corpus structure also allows using it with unsupervised machine learning methods and can be used for creation of rule-based methods, as well as in different linguistic analyses.
用于作者识别和作者简介研究的地震会话速记图
在我们的论文中,我们提出了立陶宛议会演讲的转录语料库。语料库以特定的格式准备,适用于不同的作者身份识别任务。该语料库包含约11.1万篇文本(2400万字)。从1990年3月10日开始至2013年12月23日结束的七届议会任期内,每一段文本都与议会常会上发表的一篇演讲相匹配。这些文本被分为147个类别,对应于每个作者,因此它们可以用于作者归属任务;此外,这些文本还根据年龄,性别和政治观点分组,因此它们也适合作者分析任务。鉴于短文本使作者说话风格的识别变得复杂,并且与其他作者的风格有歧义,我们仅将不少于100个单词的文本纳入语料库。为了使每个类别尽可能全面和具有代表性,我们只包括那些发表过至少200次演讲的作者。所有文本都被语法化,形态学和语法注释,标记为字符n-图。语料库的统计信息也可用。我们还证明了创建的语料库可以通过监督机器学习方法有效地用于作者归属和作者分析任务。语料库结构还允许将其与无监督机器学习方法一起使用,并可用于创建基于规则的方法,以及不同的语言分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
0.40
自引率
0.00%
发文量
0
审稿时长
19 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信