Using aggregated AI detector outcomes to eliminate false positives in STEM-student writing.

IF 1.7 4区 教育学 Q2 EDUCATION, SCIENTIFIC DISCIPLINES
Advances in Physiology Education Pub Date : 2025-06-01 Epub Date: 2025-03-19 DOI:10.1152/advan.00235.2024
Jon-Philippe K Hyatt, Elisa Jayne Bienenstock, Carla M Firetto, Elizabeth R Woods, Robert C Comus
{"title":"Using aggregated AI detector outcomes to eliminate false positives in STEM-student writing.","authors":"Jon-Philippe K Hyatt, Elisa Jayne Bienenstock, Carla M Firetto, Elizabeth R Woods, Robert C Comus","doi":"10.1152/advan.00235.2024","DOIUrl":null,"url":null,"abstract":"<p><p>Generative artificial intelligence (AI) large language models have become sufficiently accessible and user-friendly to assist students with course work, studying tactics, and written communication. AI-generated writing is almost indistinguishable from human-derived work. Instructors must rely on intuition/experience and, recently, assistance from online AI detectors to help them distinguish between student- and AI-written material. Here, we tested the veracity of AI detectors for writing samples from a fact-heavy, lower-division undergraduate anatomy and physiology course. Student participants (<i>n</i> = 190) completed three parts: a hand-written essay answering a prompt on the structure/function of the plasma membrane; creating an AI-generated answer to the same prompt; and a survey seeking participants' views on the quality of each essay as well as general AI use. Randomly selected (<i>n</i> = 50) participant-written and AI-generated essays were blindly uploaded onto four AI detectors; a separate and unique group of randomly selected essays (<i>n</i> = 48) was provided to human raters (<i>n</i> = 9) for classification assessment. For the majority of essays, human raters and the best-performing AI detectors (<i>n</i> = 3) similarly identified their correct origin (84-95% and 93-98%, respectively) (<i>P</i> > 0.05). Approximately 1.3% and 5.0% of the essays were detected as false positives (human writing incorrectly labeled as AI) by AI detectors and human raters, respectively. Surveys generally indicated that students viewed the AI-generated work as better than their own (<i>P</i> < 0.01). Using AI detectors in aggregate reduced the likelihood of detecting a false positive to nearly 0%, and this strategy was validated against human rater-labeled false positives. Taken together, our findings show that AI detectors, when used together, become a powerful tool to inform instructors.<b>NEW & NOTEWORTHY</b> We show how online artificial intelligence (AI) detectors can assist instructors in distinguishing between human- and AI-written work for written assignments. Although individual AI detectors may vary in their accuracy for correctly identifying the origin of written work, they are most effective when used in aggregate to inform instructors when human intuition gets it wrong. Using AI detectors for consensus detection reduces the false positive rate to nearly zero.</p>","PeriodicalId":50852,"journal":{"name":"Advances in Physiology Education","volume":" ","pages":"486-495"},"PeriodicalIF":1.7000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Physiology Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1152/advan.00235.2024","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/19 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0

Abstract

Generative artificial intelligence (AI) large language models have become sufficiently accessible and user-friendly to assist students with course work, studying tactics, and written communication. AI-generated writing is almost indistinguishable from human-derived work. Instructors must rely on intuition/experience and, recently, assistance from online AI detectors to help them distinguish between student- and AI-written material. Here, we tested the veracity of AI detectors for writing samples from a fact-heavy, lower-division undergraduate anatomy and physiology course. Student participants (n = 190) completed three parts: a hand-written essay answering a prompt on the structure/function of the plasma membrane; creating an AI-generated answer to the same prompt; and a survey seeking participants' views on the quality of each essay as well as general AI use. Randomly selected (n = 50) participant-written and AI-generated essays were blindly uploaded onto four AI detectors; a separate and unique group of randomly selected essays (n = 48) was provided to human raters (n = 9) for classification assessment. For the majority of essays, human raters and the best-performing AI detectors (n = 3) similarly identified their correct origin (84-95% and 93-98%, respectively) (P > 0.05). Approximately 1.3% and 5.0% of the essays were detected as false positives (human writing incorrectly labeled as AI) by AI detectors and human raters, respectively. Surveys generally indicated that students viewed the AI-generated work as better than their own (P < 0.01). Using AI detectors in aggregate reduced the likelihood of detecting a false positive to nearly 0%, and this strategy was validated against human rater-labeled false positives. Taken together, our findings show that AI detectors, when used together, become a powerful tool to inform instructors.NEW & NOTEWORTHY We show how online artificial intelligence (AI) detectors can assist instructors in distinguishing between human- and AI-written work for written assignments. Although individual AI detectors may vary in their accuracy for correctly identifying the origin of written work, they are most effective when used in aggregate to inform instructors when human intuition gets it wrong. Using AI detectors for consensus detection reduces the false positive rate to nearly zero.

使用聚合人工智能检测器的结果来消除stem学生写作中的误报。
生成式人工智能(AI)大型语言模型已经变得足够容易访问和用户友好,以帮助学生完成课程作业,学习策略和书面交流。人工智能生成的文字与人类衍生的作品几乎无法区分。教师必须依靠直觉/经验,以及最近在线人工智能检测器的帮助来区分学生和人工智能编写的材料。在这里,我们测试了人工智能检测器的准确性,用于编写来自事实较多、低级别本科解剖学和生理学课程的样本。学生参与者(n=190)完成了三个部分:一篇手写的文章回答关于质膜结构/功能的提示,为相同的提示创建人工智能生成的答案,以及一项调查,寻求参与者对每篇文章的质量以及人工智能的一般使用的看法。随机选择的(n=50)参与者撰写的和人工智能生成的文章被盲目地上传到四个人工智能检测器上;随机选择一组单独且独特的文章(n=48)提供给人类评分者(n=9)进行分类评估。对于大多数文章,人类评分者和表现最好的人工智能检测器(n=3)相似地确定了它们的正确来源(分别为84- 95%和93-98%)(p>0.05)。大约1.3%和5.0%的论文分别被人工智能检测器和人类评分者检测为假阳性(人类写作被错误地标记为人工智能)。调查普遍表明,学生们认为人工智能生成的作业比他们自己的好
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.40
自引率
19.00%
发文量
100
审稿时长
>12 weeks
期刊介绍: Advances in Physiology Education promotes and disseminates educational scholarship in order to enhance teaching and learning of physiology, neuroscience and pathophysiology. The journal publishes peer-reviewed descriptions of innovations that improve teaching in the classroom and laboratory, essays on education, and review articles based on our current understanding of physiological mechanisms. Submissions that evaluate new technologies for teaching and research, and educational pedagogy, are especially welcome. The audience for the journal includes educators at all levels: K–12, undergraduate, graduate, and professional programs.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信