Automatic Identification of Authors' Stylistics and Gender on the Basis of the Corpus of Russian Fiction Using Extended Set-theoretic Model with Collocation Extraction

IF 0.2 Q4 LINGUISTICS
Glottometrics Pub Date : 2021-05-01 DOI:10.53482/2021_50_389
Alexandr Osochkin, X. Piotrowska, Vladimir Fomin
{"title":"Automatic Identification of Authors' Stylistics and Gender on the Basis of the Corpus of Russian Fiction Using Extended Set-theoretic Model with Collocation Extraction","authors":"Alexandr Osochkin, X. Piotrowska, Vladimir Fomin","doi":"10.53482/2021_50_389","DOIUrl":null,"url":null,"abstract":"We present a novel quantitative approach for classification of authors' stylistics and gender differences based on extraction of word collocation. The proposed algorithm attenuates previously described issues of text processing using the vector models. We demonstrate the approach by analyzing a corpus of Russian prose. We discuss different approaches for classification and identification of the author's style implemented by currently-available software solutions and libraries of morphological analysis, methods of parameterization, indexing of texts, artificial intelligence algorithms and knowledge extraction. Our results demonstrate the efficiency and relative advantage of regression decision tree methods in identifying informative frequency indexes in a way that lends itself to their logical interpretation. We develop a toolkit for conducting comparative experiments to assess the effectiveness of classification of natural language text data, using vector, set-theoretic and the author's set-theoretic with collocation extraction models of text representation. Comparing the ability of different methods to identify the style and gender differences of authors of fiction works, we find that the proposed approach incorporating collocation information alleviates some of the previously identified deficiencies and yields overall improvements in the classification accuracy.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":null,"pages":null},"PeriodicalIF":0.2000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Glottometrics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.53482/2021_50_389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"LINGUISTICS","Score":null,"Total":0}
引用次数: 1

Abstract

We present a novel quantitative approach for classification of authors' stylistics and gender differences based on extraction of word collocation. The proposed algorithm attenuates previously described issues of text processing using the vector models. We demonstrate the approach by analyzing a corpus of Russian prose. We discuss different approaches for classification and identification of the author's style implemented by currently-available software solutions and libraries of morphological analysis, methods of parameterization, indexing of texts, artificial intelligence algorithms and knowledge extraction. Our results demonstrate the efficiency and relative advantage of regression decision tree methods in identifying informative frequency indexes in a way that lends itself to their logical interpretation. We develop a toolkit for conducting comparative experiments to assess the effectiveness of classification of natural language text data, using vector, set-theoretic and the author's set-theoretic with collocation extraction models of text representation. Comparing the ability of different methods to identify the style and gender differences of authors of fiction works, we find that the proposed approach incorporating collocation information alleviates some of the previously identified deficiencies and yields overall improvements in the classification accuracy.
基于搭配抽取的扩展集理论模型的俄语小说语料库作者文体与性别自动识别
本文提出了一种基于词语搭配提取的作者文体和性别差异定量分类方法。提出的算法减弱了先前描述的使用向量模型的文本处理问题。我们通过分析一个俄罗斯散文语料库来证明这种方法。我们讨论了目前可用的软件解决方案和形态学分析、参数化方法、文本索引、人工智能算法和知识提取库实现的作者风格分类和识别的不同方法。我们的结果证明了回归决策树方法在识别信息频率指标方面的效率和相对优势,这种方法有助于其逻辑解释。我们开发了一个工具包,用于进行比较实验,以评估自然语言文本数据分类的有效性,使用向量,集合论和作者的集合论与文本表示的搭配提取模型。通过比较不同方法识别小说作者风格和性别差异的能力,我们发现,结合搭配信息的方法缓解了之前发现的一些不足,并在分类准确性方面取得了总体上的提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Glottometrics
Glottometrics LINGUISTICS-
CiteScore
0.50
自引率
0.00%
发文量
0
期刊介绍: The aim of Glottometrics is quantification, measurement and mathematical modeling of any kind of language phenomena. We invite contributions on probabilistic or other mathematical models (e.g. graph theoretic or optimization approaches) which enable to establish language laws that can be validated by testing statistical hypotheses.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信