Experimental Evaluation: Can Humans Recognise Social Media Bots?

Big Data and Cognitive Computing Pub Date : 2024-02-26 DOI:10.3390/bdcc8030024

M. Kolomeets, O. Tushkanova, Vasily Desnitsky, L. Vitkova, Andrey Chechulin

{"title":"Experimental Evaluation: Can Humans Recognise Social Media Bots?","authors":"M. Kolomeets, O. Tushkanova, Vasily Desnitsky, L. Vitkova, Andrey Chechulin","doi":"10.3390/bdcc8030024","DOIUrl":null,"url":null,"abstract":"This paper aims to test the hypothesis that the quality of social media bot detection systems based on supervised machine learning may not be as accurate as researchers claim, given that bots have become increasingly sophisticated, making it difficult for human annotators to detect them better than random selection. As a result, obtaining a ground-truth dataset with human annotation is not possible, which leads to supervised machine-learning models inheriting annotation errors. To test this hypothesis, we conducted an experiment where humans were tasked with recognizing malicious bots on the VKontakte social network. We then compared the “human” answers with the “ground-truth” bot labels (‘a bot’/‘not a bot’). Based on the experiment, we evaluated the bot detection efficiency of annotators in three scenarios typical for cybersecurity but differing in their detection difficulty as follows: (1) detection among random accounts, (2) detection among accounts of a social network ‘community’, and (3) detection among verified accounts. The study showed that humans could only detect simple bots in all three scenarios but could not detect more sophisticated ones (p-value = 0.05). The study also evaluates the limits of hypothetical and existing bot detection systems that leverage non-expert-labelled datasets as follows: the balanced accuracy of such systems can drop to 0.5 and lower, depending on bot complexity and detection scenario. The paper also describes the experiment design, collected datasets, statistical evaluation, and machine learning accuracy measures applied to support the results. In the discussion, we raise the question of using human labelling in bot detection systems and its potential cybersecurity issues. We also provide open access to the datasets used, experiment results, and software code for evaluating statistical and machine learning accuracy metrics used in this paper on GitHub.","PeriodicalId":505155,"journal":{"name":"Big Data and Cognitive Computing","volume":"37 7","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data and Cognitive Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/bdcc8030024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This paper aims to test the hypothesis that the quality of social media bot detection systems based on supervised machine learning may not be as accurate as researchers claim, given that bots have become increasingly sophisticated, making it difficult for human annotators to detect them better than random selection. As a result, obtaining a ground-truth dataset with human annotation is not possible, which leads to supervised machine-learning models inheriting annotation errors. To test this hypothesis, we conducted an experiment where humans were tasked with recognizing malicious bots on the VKontakte social network. We then compared the “human” answers with the “ground-truth” bot labels (‘a bot’/‘not a bot’). Based on the experiment, we evaluated the bot detection efficiency of annotators in three scenarios typical for cybersecurity but differing in their detection difficulty as follows: (1) detection among random accounts, (2) detection among accounts of a social network ‘community’, and (3) detection among verified accounts. The study showed that humans could only detect simple bots in all three scenarios but could not detect more sophisticated ones (p-value = 0.05). The study also evaluates the limits of hypothetical and existing bot detection systems that leverage non-expert-labelled datasets as follows: the balanced accuracy of such systems can drop to 0.5 and lower, depending on bot complexity and detection scenario. The paper also describes the experiment design, collected datasets, statistical evaluation, and machine learning accuracy measures applied to support the results. In the discussion, we raise the question of using human labelling in bot detection systems and its potential cybersecurity issues. We also provide open access to the datasets used, experiment results, and software code for evaluating statistical and machine learning accuracy metrics used in this paper on GitHub.

查看原文本刊更多论文

实验评估：人类能否识别社交媒体机器人？

本文旨在验证一个假设，即基于有监督机器学习的社交媒体僵尸检测系统的质量可能并不像研究人员声称的那样准确，因为僵尸变得越来越复杂，使得人类注释者很难比随机选择更好地检测到它们。因此，不可能通过人工标注来获得真实数据集，这就导致监督机器学习模型继承了标注错误。为了验证这一假设，我们进行了一项实验，让人类在 VKontakte 社交网络上识别恶意机器人。然后，我们将 "人类 "的答案与 "地面实况 "的僵尸标签（"僵尸"/"非僵尸"）进行比较。在实验的基础上，我们评估了注释者在以下三种典型的网络安全场景中的僵尸检测效率，这些场景的检测难度各不相同：(1）随机账户中的检测；（2）社交网络 "社区 "账户中的检测；（3）验证账户中的检测。研究表明，在所有三种情况下，人类只能检测到简单的机器人，而无法检测到更复杂的机器人（P 值 = 0.05）。该研究还对利用非外部标签数据集的假设和现有僵尸检测系统的局限性进行了评估：根据僵尸复杂性和检测场景的不同，此类系统的平衡准确率可降至 0.5 或更低。本文还介绍了实验设计、收集的数据集、统计评估以及用于支持结果的机器学习准确度测量。在讨论中，我们提出了在僵尸检测系统中使用人工标记的问题及其潜在的网络安全问题。我们还在 GitHub 上提供了用于评估本文所用统计和机器学习准确度指标的数据集、实验结果和软件代码的开放访问权限。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Big Data and Cognitive Computing

自引率

0.00%

发文量