基于机器学习的乌克兰语推文舆情动态变化预测分析技术

IF 0.2 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Удк, Технологія Аналізу Україномовних, Твітів Для, Прогнозування Зміни, Динаміки Громадської, НА Думки, Основі Машинного Навчання
{"title":"基于机器学习的乌克兰语推文舆情动态变化预测分析技术","authors":"Удк, Технологія Аналізу Україномовних, Твітів Для, Прогнозування Зміни, Динаміки Громадської, НА Думки, Основі Машинного Навчання","doi":"10.15588/1607-3274-2023-2-11","DOIUrl":null,"url":null,"abstract":"Context. Automation of public opinion research will allow not only to reduce the amount of manual work, but also to obtain time slices of the results without additional efforts. Since direct interaction with respondents should be avoided, public opinion should be analyzed based on the sources of its free expression. Social networks are great for this role, as their people freely publish their thoughts or emotionally truthfully react to published information about certain events. Statistics show that data from social networks is not enough to obtain a full-fledged result, because a significant percentage of people do not use social networks. However, the automation of the study of even such a stratum of the population is already a good result for analyzing the dynamics of changes in public opinion in accordance with events in the country/world and, accordingly, for correcting the processes of public administration in the future. \nObjective of the study is to develop a technology for analyzing the flow of Ukrainian-language content in social networks for public opinion research based on finding clustered thematic groups of tweets. \nMethod. The article develops a technology for finding tweet trends based on clustering, which forms a data stream in the form of short representations of clusters and their popularity for further research of public opinion. An effective approach to tweet collection, filtering, cleaning and pre-processing based on a comparative analysis of Bag of Words, TF-IDF and BERT algorithms is described. The impact of stemming and lemmatization on the quality of the obtained clusters was determined. And optimal combinations of clustering methods (K-Means, Agglomerative Hierarchical Clustering and HDBSCAN) and vectorization of tweets were found based on the analysis of 27 clusterings of one data sample. The method of presenting clusters of tweets in a short format is selected. \nResults. Algorithms using the Levenstein Distance, i.e. fuzz sort, fuzz set and levenshtein, showed the best results. These algorithms quickly perform checks, have a greater difference in similarities, so it is possible to more accurately determine the limit of similarity. According to the results of the clustering, the optimal solutions are to use the HDBSCAN clustering algorithm and the BERT vectorization algorithm to achieve the most accurate results, and to use K-Means together with TF-IDF to achieve the best speed with the optimal result. Stemming can be used to reduce execution time. \nConclusions. In this study, the optimal options for comparing cluster fingerprints among the following similarity search methods were experimentally found: Fuzz Sort, Fuzz Set, Levenshtein, Jaro Winkler, Jaccard, Sorensen, Cosine, Sift4. In some algorithms, the average fingerprint similarity reaches above 70%. 3 effective tools were found to compare their similarity, as they show a sufficient difference between comparisons of similar and different clusters (> 20%). Based on the selected effective methods, trend analysis was successfully performed on 90,000 tweets over 7 days for 5 topics of the week using K-Means and TF-IDF for clustering and vectorization, as well as fuzz sort for cluster fingerprint comparison with a 55% similarity threshold.","PeriodicalId":43783,"journal":{"name":"Radio Electronics Computer Science Control","volume":"34 1","pages":""},"PeriodicalIF":0.2000,"publicationDate":"2023-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"UKRAINIAN LANGUAGE TWEETS ANALYSIS TECHNOLOGY FOR PUBLIC OPINION DYNAMICS CHANGE PREDICTION BASED ON MACHINE LEARNING\",\"authors\":\"Удк, Технологія Аналізу Україномовних, Твітів Для, Прогнозування Зміни, Динаміки Громадської, НА Думки, Основі Машинного Навчання\",\"doi\":\"10.15588/1607-3274-2023-2-11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Context. Automation of public opinion research will allow not only to reduce the amount of manual work, but also to obtain time slices of the results without additional efforts. Since direct interaction with respondents should be avoided, public opinion should be analyzed based on the sources of its free expression. Social networks are great for this role, as their people freely publish their thoughts or emotionally truthfully react to published information about certain events. Statistics show that data from social networks is not enough to obtain a full-fledged result, because a significant percentage of people do not use social networks. However, the automation of the study of even such a stratum of the population is already a good result for analyzing the dynamics of changes in public opinion in accordance with events in the country/world and, accordingly, for correcting the processes of public administration in the future. \\nObjective of the study is to develop a technology for analyzing the flow of Ukrainian-language content in social networks for public opinion research based on finding clustered thematic groups of tweets. \\nMethod. The article develops a technology for finding tweet trends based on clustering, which forms a data stream in the form of short representations of clusters and their popularity for further research of public opinion. An effective approach to tweet collection, filtering, cleaning and pre-processing based on a comparative analysis of Bag of Words, TF-IDF and BERT algorithms is described. The impact of stemming and lemmatization on the quality of the obtained clusters was determined. And optimal combinations of clustering methods (K-Means, Agglomerative Hierarchical Clustering and HDBSCAN) and vectorization of tweets were found based on the analysis of 27 clusterings of one data sample. The method of presenting clusters of tweets in a short format is selected. \\nResults. Algorithms using the Levenstein Distance, i.e. fuzz sort, fuzz set and levenshtein, showed the best results. These algorithms quickly perform checks, have a greater difference in similarities, so it is possible to more accurately determine the limit of similarity. According to the results of the clustering, the optimal solutions are to use the HDBSCAN clustering algorithm and the BERT vectorization algorithm to achieve the most accurate results, and to use K-Means together with TF-IDF to achieve the best speed with the optimal result. Stemming can be used to reduce execution time. \\nConclusions. In this study, the optimal options for comparing cluster fingerprints among the following similarity search methods were experimentally found: Fuzz Sort, Fuzz Set, Levenshtein, Jaro Winkler, Jaccard, Sorensen, Cosine, Sift4. In some algorithms, the average fingerprint similarity reaches above 70%. 3 effective tools were found to compare their similarity, as they show a sufficient difference between comparisons of similar and different clusters (> 20%). Based on the selected effective methods, trend analysis was successfully performed on 90,000 tweets over 7 days for 5 topics of the week using K-Means and TF-IDF for clustering and vectorization, as well as fuzz sort for cluster fingerprint comparison with a 55% similarity threshold.\",\"PeriodicalId\":43783,\"journal\":{\"name\":\"Radio Electronics Computer Science Control\",\"volume\":\"34 1\",\"pages\":\"\"},\"PeriodicalIF\":0.2000,\"publicationDate\":\"2023-06-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radio Electronics Computer Science Control\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15588/1607-3274-2023-2-11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radio Electronics Computer Science Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15588/1607-3274-2023-2-11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

上下文。民意调查的自动化不仅可以减少人工工作量,还可以在不额外努力的情况下获得结果的时间切片。因为应该避免与被调查者的直接互动,所以应该根据民意自由表达的来源来分析民意。社交网络非常适合这个角色,因为人们可以自由地发表自己的想法,或者对某些事件的公开信息做出情感上的真实反应。统计数据显示,来自社交网络的数据不足以得出一个完整的结果,因为有相当大比例的人不使用社交网络。但是,即使是对人口中这样一个阶层进行自动化研究,对于根据国家/世界的事件分析舆论变化的动态,从而纠正今后的公共行政进程来说,已经是一个很好的结果。本研究的目的是开发一种技术,用于分析乌克兰语内容在社交网络中的流量,用于民意研究,基于发现推文的聚类主题组。方法。本文开发了一种基于聚类的推文趋势发现技术,以聚类及其受欢迎程度的简短表示形式形成数据流,用于进一步的民意研究。在对比分析词袋算法、TF-IDF算法和BERT算法的基础上,提出了一种有效的推文收集、过滤、清洗和预处理方法。测定了词干化和lemmatization对所得到的簇质量的影响。通过对一个数据样本的27个聚类分析,找到了推文聚类方法(K-Means、Agglomerative Hierarchical clustering和HDBSCAN)和向量化的最优组合。选择以短格式呈现tweets集群的方法。结果。使用Levenstein距离的算法,即模糊排序、模糊集和levenshtein,显示出最好的结果。这些算法执行检查速度快,相似度差异大,因此可以更准确地确定相似度的极限。根据聚类结果,最优解是使用HDBSCAN聚类算法和BERT矢量化算法获得最准确的结果,使用K-Means结合TF-IDF获得最优结果的最佳速度。词干处理可用于减少执行时间。结论。在本研究中,实验发现了以下相似度搜索方法中比较聚类指纹的最优选择:Fuzz Sort、Fuzz Set、Levenshtein、Jaro Winkler、Jaccard、Sorensen、Cosine、sif4。在一些算法中,平均指纹相似度达到70%以上。发现了3个有效的工具来比较它们的相似性,因为它们在相似和不同聚类的比较中显示出足够的差异(> 20%)。基于所选择的有效方法,我们成功地对一周5个主题的7天内的9万条推文进行了趋势分析,使用K-Means和TF-IDF进行聚类和矢量化,并以55%的相似度阈值进行模糊排序进行聚类指纹比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
UKRAINIAN LANGUAGE TWEETS ANALYSIS TECHNOLOGY FOR PUBLIC OPINION DYNAMICS CHANGE PREDICTION BASED ON MACHINE LEARNING
Context. Automation of public opinion research will allow not only to reduce the amount of manual work, but also to obtain time slices of the results without additional efforts. Since direct interaction with respondents should be avoided, public opinion should be analyzed based on the sources of its free expression. Social networks are great for this role, as their people freely publish their thoughts or emotionally truthfully react to published information about certain events. Statistics show that data from social networks is not enough to obtain a full-fledged result, because a significant percentage of people do not use social networks. However, the automation of the study of even such a stratum of the population is already a good result for analyzing the dynamics of changes in public opinion in accordance with events in the country/world and, accordingly, for correcting the processes of public administration in the future. Objective of the study is to develop a technology for analyzing the flow of Ukrainian-language content in social networks for public opinion research based on finding clustered thematic groups of tweets. Method. The article develops a technology for finding tweet trends based on clustering, which forms a data stream in the form of short representations of clusters and their popularity for further research of public opinion. An effective approach to tweet collection, filtering, cleaning and pre-processing based on a comparative analysis of Bag of Words, TF-IDF and BERT algorithms is described. The impact of stemming and lemmatization on the quality of the obtained clusters was determined. And optimal combinations of clustering methods (K-Means, Agglomerative Hierarchical Clustering and HDBSCAN) and vectorization of tweets were found based on the analysis of 27 clusterings of one data sample. The method of presenting clusters of tweets in a short format is selected. Results. Algorithms using the Levenstein Distance, i.e. fuzz sort, fuzz set and levenshtein, showed the best results. These algorithms quickly perform checks, have a greater difference in similarities, so it is possible to more accurately determine the limit of similarity. According to the results of the clustering, the optimal solutions are to use the HDBSCAN clustering algorithm and the BERT vectorization algorithm to achieve the most accurate results, and to use K-Means together with TF-IDF to achieve the best speed with the optimal result. Stemming can be used to reduce execution time. Conclusions. In this study, the optimal options for comparing cluster fingerprints among the following similarity search methods were experimentally found: Fuzz Sort, Fuzz Set, Levenshtein, Jaro Winkler, Jaccard, Sorensen, Cosine, Sift4. In some algorithms, the average fingerprint similarity reaches above 70%. 3 effective tools were found to compare their similarity, as they show a sufficient difference between comparisons of similar and different clusters (> 20%). Based on the selected effective methods, trend analysis was successfully performed on 90,000 tweets over 7 days for 5 topics of the week using K-Means and TF-IDF for clustering and vectorization, as well as fuzz sort for cluster fingerprint comparison with a 55% similarity threshold.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Radio Electronics Computer Science Control
Radio Electronics Computer Science Control COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-
自引率
20.00%
发文量
66
审稿时长
12 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信