The Russian Language Corpus and a Neural Network to Analyse Internet Tweet Reports About Covid-19

Proceedings of The 5th International Workshop on Deep Learning in Computational Physics — PoS(DLCP2021) Pub Date : 2021-12-01 DOI:10.22323/1.410.0017

A. Sboev, I. Moloshnikov, A. Naumov, Anastasia Levochkina, R. Rybka

{"title":"The Russian Language Corpus and a Neural Network to Analyse Internet Tweet Reports About Covid-19","authors":"A. Sboev, I. Moloshnikov, A. Naumov, Anastasia Levochkina, R. Rybka","doi":"10.22323/1.410.0017","DOIUrl":null,"url":null,"abstract":"This work is aimed at creating a tool for filtering messages from Twitter users by the presence of mentions of coronavirus disease in them. For this purpose, a corpus of Russian-language tweets was created, which contains the part of 10 thousand tweets that are manually divided into several classes with different levels of confidence: potentially have covid, have covid now, other cases, and an unmarked part – 2 million tweets on the topic of the pandemic. The paper presents the process of creating a corpus of tweets from the stage of data collection, their preliminary filtering and subsequent annotation according to the presence of disease description. Machine learning methods were compared according to classification task on tweets. It is shown that a model based on the XLM-RoBERTa topology with additional training on corpus of unmarked tweets gives the F1 score of 0.85 on binary classification task (\"potentially have covid have covid now\" vs \"other\"). This is 12% higher relative to the simplest model using TF-IDF encoding and SVM classifier and 5% higher relative to the RuDR-BERT-based model. The created toolkit will expand the feature space of models for predicting the spread of coronavirus infection and other pandemics by adding the dynamics of discussion on social networks, which characterizes people’s attitudes towards it. © Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).","PeriodicalId":217453,"journal":{"name":"Proceedings of The 5th International Workshop on Deep Learning in Computational Physics — PoS(DLCP2021)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of The 5th International Workshop on Deep Learning in Computational Physics — PoS(DLCP2021)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22323/1.410.0017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

This work is aimed at creating a tool for filtering messages from Twitter users by the presence of mentions of coronavirus disease in them. For this purpose, a corpus of Russian-language tweets was created, which contains the part of 10 thousand tweets that are manually divided into several classes with different levels of confidence: potentially have covid, have covid now, other cases, and an unmarked part – 2 million tweets on the topic of the pandemic. The paper presents the process of creating a corpus of tweets from the stage of data collection, their preliminary filtering and subsequent annotation according to the presence of disease description. Machine learning methods were compared according to classification task on tweets. It is shown that a model based on the XLM-RoBERTa topology with additional training on corpus of unmarked tweets gives the F1 score of 0.85 on binary classification task ("potentially have covid have covid now" vs "other"). This is 12% higher relative to the simplest model using TF-IDF encoding and SVM classifier and 5% higher relative to the RuDR-BERT-based model. The created toolkit will expand the feature space of models for predicting the spread of coronavirus infection and other pandemics by adding the dynamics of discussion on social networks, which characterizes people’s attitudes towards it. © Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).

查看原文本刊更多论文

俄语语料库和神经网络分析有关Covid-19的互联网推特报道

这项工作的目的是创建一个工具，过滤来自推特用户的消息，根据其中提到的冠状病毒的存在。为此，我们创建了一个俄语推文语料库，其中包含1万条推文的一部分，这些推文被手动分为几个类别，具有不同的置信度:潜在的covid，现在的covid，其他病例，以及未标记的部分-关于大流行主题的200万条推文。本文介绍了从数据收集、初步过滤和根据疾病描述的存在进行后续注释的推文语料库的创建过程。根据推文分类任务对机器学习方法进行比较。结果表明，基于XLM-RoBERTa拓扑的模型在未标记推文的语料库上进行了额外的训练，在二元分类任务(“潜在拥有covid”vs“其他”)上给出了0.85的F1分数。这比使用TF-IDF编码和SVM分类器的最简单模型高12%，比基于rudr - bert的模型高5%。该工具包将增加反映人们对新冠肺炎疫情态度的社交网络讨论动态，从而扩大预测新冠肺炎疫情等传染病传播的模型的特征空间。©根据知识共享署名-非商业-非衍生品4.0国际许可协议(CC by - nc - nd 4.0)的条款，版权归作者所有。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of The 5th International Workshop on Deep Learning in Computational Physics — PoS(DLCP2021)

自引率

0.00%

发文量