{"title":"基于现代嵌入和节奏的俄语文本体裁分类","authors":"K. V. Lagutina","doi":"10.3103/S0146411623070076","DOIUrl":null,"url":null,"abstract":"<p>This article investigates modern vector text models for solving the problem of genre classifying Russian-language texts. The models include ELMo embeddings, a pretrained BERT language model, and a set of numerical rhythmic characteristics based on lexico-grammatical tools. The experiments have been carried out on a corpus of 10 000 texts in five genres: novels, scientific articles, reviews, posts from the VKontakte social network, and news from OpenCorpora. Visualization and analysis of statistics for rhythmic characteristics have made it possible to distinguish both the most diverse genres in terms of rhythm (novels and reviews) and the least (scientific articles). It is these genres that are subsequently classified best using rhythm and the LSTM neural network classifier. Clustering and classifying texts by genre using the ELMo and BERT embeddings make it possible to separate one genre from another with a small number of errors. The multiclassification F-measure reaches 99%. This study confirms the effectiveness of modern embeddings in the tasks of computational linguistics and highlights the advantages and limitations of the set rhythmic characteristics on the genre classification material.</p>","PeriodicalId":46238,"journal":{"name":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","volume":"57 7","pages":"817 - 827"},"PeriodicalIF":0.5000,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm\",\"authors\":\"K. V. Lagutina\",\"doi\":\"10.3103/S0146411623070076\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>This article investigates modern vector text models for solving the problem of genre classifying Russian-language texts. The models include ELMo embeddings, a pretrained BERT language model, and a set of numerical rhythmic characteristics based on lexico-grammatical tools. The experiments have been carried out on a corpus of 10 000 texts in five genres: novels, scientific articles, reviews, posts from the VKontakte social network, and news from OpenCorpora. Visualization and analysis of statistics for rhythmic characteristics have made it possible to distinguish both the most diverse genres in terms of rhythm (novels and reviews) and the least (scientific articles). It is these genres that are subsequently classified best using rhythm and the LSTM neural network classifier. Clustering and classifying texts by genre using the ELMo and BERT embeddings make it possible to separate one genre from another with a small number of errors. The multiclassification F-measure reaches 99%. This study confirms the effectiveness of modern embeddings in the tasks of computational linguistics and highlights the advantages and limitations of the set rhythmic characteristics on the genre classification material.</p>\",\"PeriodicalId\":46238,\"journal\":{\"name\":\"AUTOMATIC CONTROL AND COMPUTER SCIENCES\",\"volume\":\"57 7\",\"pages\":\"817 - 827\"},\"PeriodicalIF\":0.5000,\"publicationDate\":\"2024-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AUTOMATIC CONTROL AND COMPUTER SCIENCES\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://link.springer.com/article/10.3103/S0146411623070076\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.3103/S0146411623070076","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm
This article investigates modern vector text models for solving the problem of genre classifying Russian-language texts. The models include ELMo embeddings, a pretrained BERT language model, and a set of numerical rhythmic characteristics based on lexico-grammatical tools. The experiments have been carried out on a corpus of 10 000 texts in five genres: novels, scientific articles, reviews, posts from the VKontakte social network, and news from OpenCorpora. Visualization and analysis of statistics for rhythmic characteristics have made it possible to distinguish both the most diverse genres in terms of rhythm (novels and reviews) and the least (scientific articles). It is these genres that are subsequently classified best using rhythm and the LSTM neural network classifier. Clustering and classifying texts by genre using the ELMo and BERT embeddings make it possible to separate one genre from another with a small number of errors. The multiclassification F-measure reaches 99%. This study confirms the effectiveness of modern embeddings in the tasks of computational linguistics and highlights the advantages and limitations of the set rhythmic characteristics on the genre classification material.
期刊介绍:
Automatic Control and Computer Sciences is a peer reviewed journal that publishes articles on• Control systems, cyber-physical system, real-time systems, robotics, smart sensors, embedded intelligence • Network information technologies, information security, statistical methods of data processing, distributed artificial intelligence, complex systems modeling, knowledge representation, processing and management • Signal and image processing, machine learning, machine perception, computer vision