Effective classification of natural language texts and determination of speech tonality using selected machine learning methods

Вопросы безопасности Pub Date : 2022-04-01 DOI:10.25136/2409-7543.2022.4.38658

E. Pleshakova, S. T. Gataullin, A. V. Osipov, E. V. Romanova, Nikolai Sergeevich Samburov

{"title":"Effective classification of natural language texts and determination of speech tonality using selected machine learning methods","authors":"E. Pleshakova, S. T. Gataullin, A. V. Osipov, E. V. Romanova, Nikolai Sergeevich Samburov","doi":"10.25136/2409-7543.2022.4.38658","DOIUrl":null,"url":null,"abstract":"\n Currently, a huge number of texts are being generated, and there is an urgent need to organize them in a certain structure in order to perform classification and correctly define categories. The authors consider in detail such aspects of the topic as the classification of texts in natural language and the definition of the tonality of the text in the social network Twitter. The use of social networks, in addition to numerous advantages, also carries a negative character, namely, users face numerous cyber threats, such as personal data leakage, cyberbullying, spam, fake news. The main task of the analysis of the tonality of the text is to determine the emotional fullness and coloring, which will reveal the negatively colored tonality of speech. Emotional coloring or mood are purely individual traits and thus carry potential as identification tools. The main purpose of natural language text classification is to extract information from the text and use processes such as search, classification using machine learning methods. The authors separately selected and compared the following models: logistic regression, multilayer perceptron, random forest, naive Bayesian method, K-nearest neighbor method, decision tree and stochastic gradient descent. Then we tested and analyzed these methods with each other. The experimental conclusion shows that the use of TF-IDF scoring for text vectorization does not always improve the quality of the model, or it does it for individual metrics, as a result of which the indicator of the remaining metrics for a particular model decreases. The best method to accomplish the purpose of the work is Stochastic gradient descent.\n","PeriodicalId":150406,"journal":{"name":"Вопросы безопасности","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Вопросы безопасности","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25136/2409-7543.2022.4.38658","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Currently, a huge number of texts are being generated, and there is an urgent need to organize them in a certain structure in order to perform classification and correctly define categories. The authors consider in detail such aspects of the topic as the classification of texts in natural language and the definition of the tonality of the text in the social network Twitter. The use of social networks, in addition to numerous advantages, also carries a negative character, namely, users face numerous cyber threats, such as personal data leakage, cyberbullying, spam, fake news. The main task of the analysis of the tonality of the text is to determine the emotional fullness and coloring, which will reveal the negatively colored tonality of speech. Emotional coloring or mood are purely individual traits and thus carry potential as identification tools. The main purpose of natural language text classification is to extract information from the text and use processes such as search, classification using machine learning methods. The authors separately selected and compared the following models: logistic regression, multilayer perceptron, random forest, naive Bayesian method, K-nearest neighbor method, decision tree and stochastic gradient descent. Then we tested and analyzed these methods with each other. The experimental conclusion shows that the use of TF-IDF scoring for text vectorization does not always improve the quality of the model, or it does it for individual metrics, as a result of which the indicator of the remaining metrics for a particular model decreases. The best method to accomplish the purpose of the work is Stochastic gradient descent.

查看原文本刊更多论文

使用选定的机器学习方法对自然语言文本进行有效分类并确定语音调性

目前，大量的文本正在生成，迫切需要将它们组织成一定的结构，以便进行分类和正确定义类别。作者详细考虑了自然语言文本的分类和社交网络Twitter文本调性的定义等方面的主题。使用社交网络，除了有很多好处之外，也有负面的一面，即用户面临着大量的网络威胁，如个人数据泄露、网络欺凌、垃圾邮件、假新闻等。文本调性分析的主要任务是确定情感的丰满性和色彩，从而揭示言语的负色彩调性。情感色彩或情绪纯粹是个人特征，因此具有作为识别工具的潜力。自然语言文本分类的主要目的是从文本中提取信息，并使用机器学习方法进行搜索、分类等过程。作者分别选择并比较了逻辑回归、多层感知器、随机森林、朴素贝叶斯方法、k近邻方法、决策树和随机梯度下降等模型。然后对这些方法进行了测试和分析。实验结论表明，使用TF-IDF评分进行文本向量化并不总是提高模型的质量，或者它对单个指标有所改善，因此特定模型的剩余指标的指标降低了。达到这一目的的最佳方法是随机梯度下降法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Вопросы безопасности

自引率

0.00%

发文量