Identifying bot activity in GitHub pull request and issue comments

M. Golzadeh, Alexandre Decan, Eleni Constantinou, T. Mens
{"title":"Identifying bot activity in GitHub pull request and issue comments","authors":"M. Golzadeh, Alexandre Decan, Eleni Constantinou, T. Mens","doi":"10.1109/BotSE52550.2021.00012","DOIUrl":null,"url":null,"abstract":"Development bots are used on Github to automate repetitive activities. Such bots communicate with human actors via issue comments and pull request comments. Identifying such bot comments allows to prevent bias in socio-technical studies related to software development. To automate their identification, we propose a classification model based on natural language processing. Starting from a balanced ground-truth dataset of 19,282 PR and issue comments, we encode the comments as vectors using a combination of the bag of words and TF-IDF techniques. We train a range of binary classifiers to predict the type of comment (human or bot) based on this vector representation. A multinomial Naive Bayes classifier provides the best results. Its performance on a test set containing 50% of the data achieves an average precision, recall, and F1 score of 0.88. Although the model shows a promising result on the pull request and issue comments, further work is required to generalize the model on other types of activities, like commit messages and code reviews.","PeriodicalId":339364,"journal":{"name":"2021 IEEE/ACM Third International Workshop on Bots in Software Engineering (BotSE)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACM Third International Workshop on Bots in Software Engineering (BotSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BotSE52550.2021.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

Development bots are used on Github to automate repetitive activities. Such bots communicate with human actors via issue comments and pull request comments. Identifying such bot comments allows to prevent bias in socio-technical studies related to software development. To automate their identification, we propose a classification model based on natural language processing. Starting from a balanced ground-truth dataset of 19,282 PR and issue comments, we encode the comments as vectors using a combination of the bag of words and TF-IDF techniques. We train a range of binary classifiers to predict the type of comment (human or bot) based on this vector representation. A multinomial Naive Bayes classifier provides the best results. Its performance on a test set containing 50% of the data achieves an average precision, recall, and F1 score of 0.88. Although the model shows a promising result on the pull request and issue comments, further work is required to generalize the model on other types of activities, like commit messages and code reviews.
识别机器人活动在GitHub拉请求和发布评论
开发机器人在Github上用于自动化重复活动。这些机器人通过发布评论和拉取请求评论与人类演员进行交流。识别这样的机器人评论可以防止与软件开发相关的社会技术研究中的偏见。为了自动识别它们,我们提出了一种基于自然语言处理的分类模型。从19,282个PR和发布评论的平衡基础事实数据集开始,我们使用词包和TF-IDF技术的组合将评论编码为向量。我们训练了一系列二元分类器来预测基于这个向量表示的评论类型(人类或机器人)。多项式朴素贝叶斯分类器提供了最好的结果。它在包含50%数据的测试集上的性能达到了平均精度、召回率和F1分数0.88。尽管该模型在pull请求和issue注释上显示了一个有希望的结果,但是需要进一步的工作来将该模型推广到其他类型的活动上,比如提交消息和代码审查。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信