IMPROVING THE QUALITY OF SPAM DETECTION OF COMMENTS USING SENTIMENT ANALYSIS WITH MACHINE LEARNING

Computer systems and information technologies Pub Date : 2023-03-30 DOI:10.31891/csit-2023-1-6

Oleksandr Iermolaiev, I. Kulakovska

{"title":"IMPROVING THE QUALITY OF SPAM DETECTION OF COMMENTS USING SENTIMENT ANALYSIS WITH MACHINE LEARNING","authors":"Oleksandr Iermolaiev, I. Kulakovska","doi":"10.31891/csit-2023-1-6","DOIUrl":null,"url":null,"abstract":"Nowadays, people spend more and more time on the Internet and visit various sites. Many of these sites have comments to help people make decisions. For example, many visitors of an online store check a product’s reviews before buying, or video hosting users check at comments before watching a video. However, not all comments are equally useful. There are a lot of spam comments that do not carry any useful information. The number of spam comments increased especially strongly during a full-scale invasion, when the enemy with the help of bots tries to sow panic and spam the Internet. Very often such comments have different emotional tone than ordinary ones, so it makes sense to use tonality analysis to detect spam comments. The aim of the study is to improve the quality of spam search by doing sentiment analysis (determining the tonality) of comments using machine learning. As a result, an LSTM neural network and a dataset were selected. Three metrics for evaluating the quality of a neural network were described. The original dataset was analyzed and split into training, validation, and test datasets. The neural network was trained on the Google Colab platform using GPUs. As a result, the neural network was able to evaluate the tonality of the comment on a scale from 1 to 5, where the higher the score, the more emotionally positive the text and vice versa. After training, the neural network achieved an accuracy of 76.3% on the test dataset, and the RMSE (root mean squared error) was 0.6478, so the error is by less than one class. With using Naive Bayes classifier without tonality analysis, the accuracy reached 88.3%, while with the text tonality parameter, the accuracy increased to 93.1%. With using Random Forest algorithm without tonality analysis, the accuracy reached 90.8%, while with the text tonality parameter, the accuracy increased to 95.7%. As a result, adding the tonality parameter increased the accuracy for both models. The value of the increase in accuracy is 4.8% for the Naive Bayes classifier and 4.9% for the Random Forest. \n ","PeriodicalId":353631,"journal":{"name":"Computer systems and information technologies","volume":"104 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer systems and information technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31891/csit-2023-1-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Nowadays, people spend more and more time on the Internet and visit various sites. Many of these sites have comments to help people make decisions. For example, many visitors of an online store check a product’s reviews before buying, or video hosting users check at comments before watching a video. However, not all comments are equally useful. There are a lot of spam comments that do not carry any useful information. The number of spam comments increased especially strongly during a full-scale invasion, when the enemy with the help of bots tries to sow panic and spam the Internet. Very often such comments have different emotional tone than ordinary ones, so it makes sense to use tonality analysis to detect spam comments. The aim of the study is to improve the quality of spam search by doing sentiment analysis (determining the tonality) of comments using machine learning. As a result, an LSTM neural network and a dataset were selected. Three metrics for evaluating the quality of a neural network were described. The original dataset was analyzed and split into training, validation, and test datasets. The neural network was trained on the Google Colab platform using GPUs. As a result, the neural network was able to evaluate the tonality of the comment on a scale from 1 to 5, where the higher the score, the more emotionally positive the text and vice versa. After training, the neural network achieved an accuracy of 76.3% on the test dataset, and the RMSE (root mean squared error) was 0.6478, so the error is by less than one class. With using Naive Bayes classifier without tonality analysis, the accuracy reached 88.3%, while with the text tonality parameter, the accuracy increased to 93.1%. With using Random Forest algorithm without tonality analysis, the accuracy reached 90.8%, while with the text tonality parameter, the accuracy increased to 95.7%. As a result, adding the tonality parameter increased the accuracy for both models. The value of the increase in accuracy is 4.8% for the Naive Bayes classifier and 4.9% for the Random Forest.

查看原文本刊更多论文

利用机器学习的情感分析提高评论垃圾邮件检测的质量

如今，人们花越来越多的时间在互联网上，访问各种各样的网站。许多这样的网站都有评论来帮助人们做决定。例如，许多在线商店的访客在购买之前会查看产品的评论，或者视频托管用户在观看视频之前会查看评论。然而，并不是所有的注释都同样有用。有很多垃圾评论没有携带任何有用的信息。在全面入侵期间，当敌人在机器人的帮助下试图散布恐慌并在互联网上发送垃圾邮件时，垃圾评论的数量增加得尤其强烈。这类评论往往具有与普通评论不同的情感基调，因此使用调性分析来检测垃圾评论是有意义的。该研究的目的是通过使用机器学习进行评论的情感分析(确定调性)来提高垃圾邮件搜索的质量。结果，选择了LSTM神经网络和数据集。描述了评估神经网络质量的三个指标。对原始数据集进行分析，并将其分为训练、验证和测试数据集。利用gpu在谷歌Colab平台上对神经网络进行训练。结果，神经网络能够在1到5的范围内评估评论的调性，得分越高，文本的情感越积极，反之亦然。经过训练，神经网络在测试数据集上的准确率达到76.3%，RMSE(均方根误差)为0.6478，误差小于一个类。使用无调性分析的朴素贝叶斯分类器，准确率达到88.3%，而使用文本调性参数，准确率提高到93.1%。使用无调性分析的随机森林算法，准确率达到90.8%，使用文本调性参数，准确率提高到95.7%。因此，添加调性参数提高了两个模型的准确性。朴素贝叶斯分类器的准确率提高了4.8%，随机森林分类器的准确率提高了4.9%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer systems and information technologies

自引率

0.00%

发文量