基于现代嵌入和节奏的俄语文本体裁分类

IF 0.5 Q4 AUTOMATION & CONTROL SYSTEMS
K. V. Lagutina
{"title":"基于现代嵌入和节奏的俄语文本体裁分类","authors":"K. V. Lagutina","doi":"10.3103/S0146411623070076","DOIUrl":null,"url":null,"abstract":"<p>This article investigates modern vector text models for solving the problem of genre classifying Russian-language texts. The models include ELMo embeddings, a pretrained BERT language model, and a set of numerical rhythmic characteristics based on lexico-grammatical tools. The experiments have been carried out on a corpus of 10 000 texts in five genres: novels, scientific articles, reviews, posts from the VKontakte social network, and news from OpenCorpora. Visualization and analysis of statistics for rhythmic characteristics have made it possible to distinguish both the most diverse genres in terms of rhythm (novels and reviews) and the least (scientific articles). It is these genres that are subsequently classified best using rhythm and the LSTM neural network classifier. Clustering and classifying texts by genre using the ELMo and BERT embeddings make it possible to separate one genre from another with a small number of errors. The multiclassification F-measure reaches 99%. This study confirms the effectiveness of modern embeddings in the tasks of computational linguistics and highlights the advantages and limitations of the set rhythmic characteristics on the genre classification material.</p>","PeriodicalId":46238,"journal":{"name":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","volume":"57 7","pages":"817 - 827"},"PeriodicalIF":0.5000,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm\",\"authors\":\"K. V. Lagutina\",\"doi\":\"10.3103/S0146411623070076\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>This article investigates modern vector text models for solving the problem of genre classifying Russian-language texts. The models include ELMo embeddings, a pretrained BERT language model, and a set of numerical rhythmic characteristics based on lexico-grammatical tools. The experiments have been carried out on a corpus of 10 000 texts in five genres: novels, scientific articles, reviews, posts from the VKontakte social network, and news from OpenCorpora. Visualization and analysis of statistics for rhythmic characteristics have made it possible to distinguish both the most diverse genres in terms of rhythm (novels and reviews) and the least (scientific articles). It is these genres that are subsequently classified best using rhythm and the LSTM neural network classifier. Clustering and classifying texts by genre using the ELMo and BERT embeddings make it possible to separate one genre from another with a small number of errors. The multiclassification F-measure reaches 99%. This study confirms the effectiveness of modern embeddings in the tasks of computational linguistics and highlights the advantages and limitations of the set rhythmic characteristics on the genre classification material.</p>\",\"PeriodicalId\":46238,\"journal\":{\"name\":\"AUTOMATIC CONTROL AND COMPUTER SCIENCES\",\"volume\":\"57 7\",\"pages\":\"817 - 827\"},\"PeriodicalIF\":0.5000,\"publicationDate\":\"2024-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AUTOMATIC CONTROL AND COMPUTER SCIENCES\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://link.springer.com/article/10.3103/S0146411623070076\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.3103/S0146411623070076","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

摘要 本文研究了用于解决俄语文本体裁分类问题的现代矢量文本模型。这些模型包括 ELMo 嵌入、预训练 BERT 语言模型和一套基于词典-语法工具的数字节奏特征。实验是在一个包含 10 000 篇文本的语料库中进行的,这些文本包括五种体裁:小说、科学文章、评论、来自 VKontakte 社交网络的帖子和来自 OpenCorpora 的新闻。通过对节奏特征的可视化统计和分析,可以区分出在节奏方面最多样化的体裁(小说和评论)和最不多样化的体裁(科技文章)。正是这些体裁随后通过节奏和 LSTM 神经网络分类器得到了最佳分类。使用 ELMo 和 BERT 嵌入对文本进行流派聚类和分类,可以在误差较小的情况下将一种流派与另一种流派区分开来。多重分类的 F-measure 达到 99%。这项研究证实了现代嵌入式在计算语言学任务中的有效性,并强调了韵律特征集在体裁分类材料中的优势和局限性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm

Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm

Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm

This article investigates modern vector text models for solving the problem of genre classifying Russian-language texts. The models include ELMo embeddings, a pretrained BERT language model, and a set of numerical rhythmic characteristics based on lexico-grammatical tools. The experiments have been carried out on a corpus of 10 000 texts in five genres: novels, scientific articles, reviews, posts from the VKontakte social network, and news from OpenCorpora. Visualization and analysis of statistics for rhythmic characteristics have made it possible to distinguish both the most diverse genres in terms of rhythm (novels and reviews) and the least (scientific articles). It is these genres that are subsequently classified best using rhythm and the LSTM neural network classifier. Clustering and classifying texts by genre using the ELMo and BERT embeddings make it possible to separate one genre from another with a small number of errors. The multiclassification F-measure reaches 99%. This study confirms the effectiveness of modern embeddings in the tasks of computational linguistics and highlights the advantages and limitations of the set rhythmic characteristics on the genre classification material.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
AUTOMATIC CONTROL AND COMPUTER SCIENCES
AUTOMATIC CONTROL AND COMPUTER SCIENCES AUTOMATION & CONTROL SYSTEMS-
CiteScore
1.70
自引率
22.20%
发文量
47
期刊介绍: Automatic Control and Computer Sciences is a peer reviewed journal that publishes articles on• Control systems, cyber-physical system, real-time systems, robotics, smart sensors, embedded intelligence • Network information technologies, information security, statistical methods of data processing, distributed artificial intelligence, complex systems modeling, knowledge representation, processing and management • Signal and image processing, machine learning, machine perception, computer vision
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信