Human, bot or both? A study on the capabilities of classification models on mixed accounts

2021 IEEE International Conference on Software Maintenance and Evolution (ICSME) Pub Date : 2021-09-01 DOI:10.26226/morressier.613b5419842293c031b5b63d

Nathan Cassee, Christos Kitsanelis, Eleni Constantinou, Alexander Serebrenik

{"title":"Human, bot or both? A study on the capabilities of classification models on mixed accounts","authors":"Nathan Cassee, Christos Kitsanelis, Eleni Constantinou, Alexander Serebrenik","doi":"10.26226/morressier.613b5419842293c031b5b63d","DOIUrl":null,"url":null,"abstract":"Several bot detection algorithms have recently been discussed in the literature, as software bots that perform maintenance tasks have become more popular in recent years. State-of-the-art techniques detect bots based on a binary classification, where a GitHub account is either a human or a bot. However, this conceptualisation of bot detection as an account-level binary classification problem fails to account for ‘mixed accounts’, accounts that are shared between a human and a bot, and that therefore exhibit both bot and human activity. By using binary classification models for bot detection, researchers might hence mischaracterize both human and bot behavior in software maintenance. This calls for conceptualisation of bot detection through a comment-level classification. However, the single such approach solely investigates a small number of mixed account comments. The nature of mixed accounts on GitHub is thus yet unknown, and the absence of appropriate datasets make this a difficult problem to study. In this paper, we investigate three comment-level classification models and we evaluate these classifiers on a manually labeled dataset of mixed accounts. We find that the best classifiers based on these classification models achieve a precision and recall between 88% and 96%. However, even the most accurate comment-level classifier cannot accurately detect mixed accounts; rather, we find that textual content alone, or textual content combined with templates used by bots, are very effective features for the detection of both bot and mixed accounts. Our study calls for more accurate bot detection techniques capable of identifying mixed accounts, and as such supporting more refined insights in software maintenance activities performed by humans and bots on social coding sites.","PeriodicalId":205629,"journal":{"name":"2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26226/morressier.613b5419842293c031b5b63d","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Several bot detection algorithms have recently been discussed in the literature, as software bots that perform maintenance tasks have become more popular in recent years. State-of-the-art techniques detect bots based on a binary classification, where a GitHub account is either a human or a bot. However, this conceptualisation of bot detection as an account-level binary classification problem fails to account for ‘mixed accounts’, accounts that are shared between a human and a bot, and that therefore exhibit both bot and human activity. By using binary classification models for bot detection, researchers might hence mischaracterize both human and bot behavior in software maintenance. This calls for conceptualisation of bot detection through a comment-level classification. However, the single such approach solely investigates a small number of mixed account comments. The nature of mixed accounts on GitHub is thus yet unknown, and the absence of appropriate datasets make this a difficult problem to study. In this paper, we investigate three comment-level classification models and we evaluate these classifiers on a manually labeled dataset of mixed accounts. We find that the best classifiers based on these classification models achieve a precision and recall between 88% and 96%. However, even the most accurate comment-level classifier cannot accurately detect mixed accounts; rather, we find that textual content alone, or textual content combined with templates used by bots, are very effective features for the detection of both bot and mixed accounts. Our study calls for more accurate bot detection techniques capable of identifying mixed accounts, and as such supporting more refined insights in software maintenance activities performed by humans and bots on social coding sites.

查看原文本刊更多论文

人类、机器人还是两者都有?混合账目分类模型的能力研究

最近在文献中讨论了几种机器人检测算法，因为执行维护任务的软件机器人近年来变得越来越流行。最先进的技术基于二进制分类检测机器人，其中GitHub帐户要么是人类，要么是机器人。然而，这种将机器人检测概念化为帐户级二元分类问题的方法无法解释“混合帐户”，即在人和机器人之间共享的帐户，因此同时表现出机器人和人类的活动。通过使用二进制分类模型进行机器人检测，研究人员可能因此在软件维护中错误地描述人类和机器人的行为。这需要通过注释级分类对bot检测进行概念化。然而，单一的这种方法只调查了少数混合帐户评论。因此，GitHub上混合账户的性质尚不清楚，缺乏适当的数据集使得这成为一个难以研究的问题。在本文中，我们研究了三种评论级分类模型，并在手动标记的混合帐户数据集上评估了这些分类器。我们发现基于这些分类模型的最佳分类器的准确率和召回率在88%到96%之间。然而，即使是最准确的评论级分类器也不能准确地检测混合帐户;相反，我们发现文本内容本身，或者文本内容与机器人使用的模板相结合，对于检测机器人和混合账户都是非常有效的特征。我们的研究需要更准确的机器人检测技术，能够识别混合账户，从而支持在社交编码网站上由人类和机器人执行的软件维护活动中更精确的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)

自引率

0.00%

发文量