Unsupervised [randomly responding] survey bot detection: In search of high classification accuracy.

IF 7.6 1区 心理学 Q1 PSYCHOLOGY, MULTIDISCIPLINARY
Carl F Falk, Amaris Huang, Michael John Ilagan
{"title":"Unsupervised [randomly responding] survey bot detection: In search of high classification accuracy.","authors":"Carl F Falk, Amaris Huang, Michael John Ilagan","doi":"10.1037/met0000746","DOIUrl":null,"url":null,"abstract":"<p><p>While online survey data collection has become popular in the social sciences, there is a risk of data contamination by computer-generated random responses (i.e., bots). Bot prevalence poses a significant threat to data quality. If deterrence efforts fail or were not set up in advance, researchers can still attempt to detect bots already present in the data. In this research, we study a recently developed algorithm to detect survey bots. The algorithm requires neither a measurement model nor a sample of known humans and bots; thus, it is model agnostic and unsupervised. It involves a permutation test under the assumption that Likert-type items are exchangeable for bots, but not humans. While the algorithm maintains a desired sensitivity for detecting bots (e.g., 95%), its classification accuracy may depend on other inventory-specific or demographic factors. Generating hypothetical human responses from a well-known item response theory model, we use simulations to understand how classification accuracy is affected by item properties, the number of items, the number of latent factors, and factor correlations. In an additional study, we simulate bots to contaminate real human data from 35 publicly available data sets to understand the algorithm's classification accuracy under a variety of real measurement instruments. Through this work, we identify conditions under which classification accuracy is around 95% or above, but also conditions under which accuracy is quite low. In brief, performance is better with more items, more categories per item, and a variety in the difficulty or means of the survey items. (PsycInfo Database Record (c) 2025 APA, all rights reserved).</p>","PeriodicalId":20782,"journal":{"name":"Psychological methods","volume":" ","pages":""},"PeriodicalIF":7.6000,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Psychological methods","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1037/met0000746","RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

While online survey data collection has become popular in the social sciences, there is a risk of data contamination by computer-generated random responses (i.e., bots). Bot prevalence poses a significant threat to data quality. If deterrence efforts fail or were not set up in advance, researchers can still attempt to detect bots already present in the data. In this research, we study a recently developed algorithm to detect survey bots. The algorithm requires neither a measurement model nor a sample of known humans and bots; thus, it is model agnostic and unsupervised. It involves a permutation test under the assumption that Likert-type items are exchangeable for bots, but not humans. While the algorithm maintains a desired sensitivity for detecting bots (e.g., 95%), its classification accuracy may depend on other inventory-specific or demographic factors. Generating hypothetical human responses from a well-known item response theory model, we use simulations to understand how classification accuracy is affected by item properties, the number of items, the number of latent factors, and factor correlations. In an additional study, we simulate bots to contaminate real human data from 35 publicly available data sets to understand the algorithm's classification accuracy under a variety of real measurement instruments. Through this work, we identify conditions under which classification accuracy is around 95% or above, but also conditions under which accuracy is quite low. In brief, performance is better with more items, more categories per item, and a variety in the difficulty or means of the survey items. (PsycInfo Database Record (c) 2025 APA, all rights reserved).

无监督[随机响应]调查机器人检测:寻求高分类精度。
虽然在线调查数据收集在社会科学领域已经变得很流行,但计算机生成的随机响应(即机器人)存在数据污染的风险。Bot的流行对数据质量构成了重大威胁。如果威慑措施失败或没有提前设置,研究人员仍然可以尝试检测数据中已经存在的机器人。在本研究中,我们研究了最近开发的一种检测调查机器人的算法。该算法既不需要测量模型,也不需要已知人类和机器人的样本;因此,它是模型不可知论和无监督的。它包含了一个排列测试,假设likert类型的道具可以与bot交换,但不能与人类交换。虽然该算法在检测机器人方面保持了理想的灵敏度(例如95%),但其分类准确性可能取决于其他特定于库存或人口统计因素。从一个著名的项目反应理论模型中生成假设的人类反应,我们使用模拟来了解分类准确性如何受到项目属性、项目数量、潜在因素数量和因素相关性的影响。在另一项研究中,我们模拟机器人污染来自35个公开可用数据集的真实人类数据,以了解算法在各种真实测量仪器下的分类准确性。通过这项工作,我们确定了分类准确率在95%左右或以上的情况,以及准确率相当低的情况。简而言之,项目越多,每个项目的类别越多,调查项目的难度或手段也越多样,表现就越好。(PsycInfo Database Record (c) 2025 APA,版权所有)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Psychological methods
Psychological methods PSYCHOLOGY, MULTIDISCIPLINARY-
CiteScore
13.10
自引率
7.10%
发文量
159
期刊介绍: Psychological Methods is devoted to the development and dissemination of methods for collecting, analyzing, understanding, and interpreting psychological data. Its purpose is the dissemination of innovations in research design, measurement, methodology, and quantitative and qualitative analysis to the psychological community; its further purpose is to promote effective communication about related substantive and methodological issues. The audience is expected to be diverse and to include those who develop new procedures, those who are responsible for undergraduate and graduate training in design, measurement, and statistics, as well as those who employ those procedures in research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信