Jon-Philippe K Hyatt, Elisa Jayne Bienenstock, Carla M Firetto, Elizabeth R Woods, Robert C Comus
{"title":"Using aggregated AI detector outcomes to eliminate false positives in STEM-student writing.","authors":"Jon-Philippe K Hyatt, Elisa Jayne Bienenstock, Carla M Firetto, Elizabeth R Woods, Robert C Comus","doi":"10.1152/advan.00235.2024","DOIUrl":null,"url":null,"abstract":"<p><p>Generative artificial intelligence (AI) large language models have become sufficiently accessible and user-friendly to assist students with course work, studying tactics, and written communication. AI-generated writing is almost indistinguishable from human-derived work. Instructors must rely on intuition/experience and, recently, assistance from online AI detectors to help them distinguish between student- and AI-written material. Here, we tested the veracity of AI detectors for writing samples from a fact-heavy, lower-division undergraduate anatomy and physiology course. Student participants (<i>n</i> = 190) completed three parts: a hand-written essay answering a prompt on the structure/function of the plasma membrane; creating an AI-generated answer to the same prompt; and a survey seeking participants' views on the quality of each essay as well as general AI use. Randomly selected (<i>n</i> = 50) participant-written and AI-generated essays were blindly uploaded onto four AI detectors; a separate and unique group of randomly selected essays (<i>n</i> = 48) was provided to human raters (<i>n</i> = 9) for classification assessment. For the majority of essays, human raters and the best-performing AI detectors (<i>n</i> = 3) similarly identified their correct origin (84-95% and 93-98%, respectively) (<i>P</i> > 0.05). Approximately 1.3% and 5.0% of the essays were detected as false positives (human writing incorrectly labeled as AI) by AI detectors and human raters, respectively. Surveys generally indicated that students viewed the AI-generated work as better than their own (<i>P</i> < 0.01). Using AI detectors in aggregate reduced the likelihood of detecting a false positive to nearly 0%, and this strategy was validated against human rater-labeled false positives. Taken together, our findings show that AI detectors, when used together, become a powerful tool to inform instructors.<b>NEW & NOTEWORTHY</b> We show how online artificial intelligence (AI) detectors can assist instructors in distinguishing between human- and AI-written work for written assignments. Although individual AI detectors may vary in their accuracy for correctly identifying the origin of written work, they are most effective when used in aggregate to inform instructors when human intuition gets it wrong. Using AI detectors for consensus detection reduces the false positive rate to nearly zero.</p>","PeriodicalId":50852,"journal":{"name":"Advances in Physiology Education","volume":" ","pages":"486-495"},"PeriodicalIF":1.7000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Physiology Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1152/advan.00235.2024","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/19 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0
Abstract
Generative artificial intelligence (AI) large language models have become sufficiently accessible and user-friendly to assist students with course work, studying tactics, and written communication. AI-generated writing is almost indistinguishable from human-derived work. Instructors must rely on intuition/experience and, recently, assistance from online AI detectors to help them distinguish between student- and AI-written material. Here, we tested the veracity of AI detectors for writing samples from a fact-heavy, lower-division undergraduate anatomy and physiology course. Student participants (n = 190) completed three parts: a hand-written essay answering a prompt on the structure/function of the plasma membrane; creating an AI-generated answer to the same prompt; and a survey seeking participants' views on the quality of each essay as well as general AI use. Randomly selected (n = 50) participant-written and AI-generated essays were blindly uploaded onto four AI detectors; a separate and unique group of randomly selected essays (n = 48) was provided to human raters (n = 9) for classification assessment. For the majority of essays, human raters and the best-performing AI detectors (n = 3) similarly identified their correct origin (84-95% and 93-98%, respectively) (P > 0.05). Approximately 1.3% and 5.0% of the essays were detected as false positives (human writing incorrectly labeled as AI) by AI detectors and human raters, respectively. Surveys generally indicated that students viewed the AI-generated work as better than their own (P < 0.01). Using AI detectors in aggregate reduced the likelihood of detecting a false positive to nearly 0%, and this strategy was validated against human rater-labeled false positives. Taken together, our findings show that AI detectors, when used together, become a powerful tool to inform instructors.NEW & NOTEWORTHY We show how online artificial intelligence (AI) detectors can assist instructors in distinguishing between human- and AI-written work for written assignments. Although individual AI detectors may vary in their accuracy for correctly identifying the origin of written work, they are most effective when used in aggregate to inform instructors when human intuition gets it wrong. Using AI detectors for consensus detection reduces the false positive rate to nearly zero.
期刊介绍:
Advances in Physiology Education promotes and disseminates educational scholarship in order to enhance teaching and learning of physiology, neuroscience and pathophysiology. The journal publishes peer-reviewed descriptions of innovations that improve teaching in the classroom and laboratory, essays on education, and review articles based on our current understanding of physiological mechanisms. Submissions that evaluate new technologies for teaching and research, and educational pedagogy, are especially welcome. The audience for the journal includes educators at all levels: K–12, undergraduate, graduate, and professional programs.