Yavuz Selim Kıyak, Özlem Coşkun, Işıl İrem Budakoğlu
{"title":"'ChatGPT can make mistakes' warnings fail: A randomized controlled trial.","authors":"Yavuz Selim Kıyak, Özlem Coşkun, Işıl İrem Budakoğlu","doi":"10.1111/medu.70056","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Warnings are commonly used to signal the fallibility of AI systems like ChatGPT in clinical decision-making. Yet, little is known about whether such disclaimers influence medical students' diagnostic behaviour. Drawing on the Judge-Advisor System (JAS) theory, we investigated whether the warning alters advice-taking behaviour by modifying perceived advisor credibility.</p><p><strong>Method: </strong>In this randomized controlled trial, 186 fourth-year medical students evaluated three clinical vignettes with two diagnostic options. Each case was specifically designed to include the presentations of both diagnoses to make the case ambiguous. Students were randomly assigned to receive feedback either with (warning arm) or without (no-warning arm) a prominently displayed warning ('ChatGPT can make mistakes. Check important info'.). After submitting their initial response, students received ChatGPT-attributed disagreeing diagnostic feedback explaining why the alternate diagnosis was correct. Then they were given the opportunity to revise their original choice. Advice-taking was measured by whether students changed their diagnosis after viewing AI input. We analysed change rates, weight-of-advice (WoA) and used mixed-effects models to assess intervention effects.</p><p><strong>Results: </strong>The warning did not influence diagnostic changes (15.3% no-warning vs. 15.9% warning; OR = 1.09, 95% CI: 0.46-2.59, p = 0.84). The WoA was 0.15 (SD = 0.36), significantly lower than the 0.30 average in prior JAS meta-analysis (p < 0.001). Among students who retained their original diagnosis, the warning group showed a tendency toward providing explanations on why they disagree with the AI advisor (60% vs. 51%, p = 0.059).</p><p><strong>Conclusions: </strong>The students underweight AI's diagnostic advice. The disclaimer did not alter students' use of AI advice, suggesting that their perceived credibility of ChatGPT was already near a behavioural floor. This finding supports the existence of a credibility threshold, beyond which additional cautionary cues have limited effect. Our results refine advice-taking theory and signal that simple warnings may be insufficient to ensure calibrated trust in AI-supported learning.</p>","PeriodicalId":18370,"journal":{"name":"Medical Education","volume":" ","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1111/medu.70056","RegionNum":1,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Warnings are commonly used to signal the fallibility of AI systems like ChatGPT in clinical decision-making. Yet, little is known about whether such disclaimers influence medical students' diagnostic behaviour. Drawing on the Judge-Advisor System (JAS) theory, we investigated whether the warning alters advice-taking behaviour by modifying perceived advisor credibility.
Method: In this randomized controlled trial, 186 fourth-year medical students evaluated three clinical vignettes with two diagnostic options. Each case was specifically designed to include the presentations of both diagnoses to make the case ambiguous. Students were randomly assigned to receive feedback either with (warning arm) or without (no-warning arm) a prominently displayed warning ('ChatGPT can make mistakes. Check important info'.). After submitting their initial response, students received ChatGPT-attributed disagreeing diagnostic feedback explaining why the alternate diagnosis was correct. Then they were given the opportunity to revise their original choice. Advice-taking was measured by whether students changed their diagnosis after viewing AI input. We analysed change rates, weight-of-advice (WoA) and used mixed-effects models to assess intervention effects.
Results: The warning did not influence diagnostic changes (15.3% no-warning vs. 15.9% warning; OR = 1.09, 95% CI: 0.46-2.59, p = 0.84). The WoA was 0.15 (SD = 0.36), significantly lower than the 0.30 average in prior JAS meta-analysis (p < 0.001). Among students who retained their original diagnosis, the warning group showed a tendency toward providing explanations on why they disagree with the AI advisor (60% vs. 51%, p = 0.059).
Conclusions: The students underweight AI's diagnostic advice. The disclaimer did not alter students' use of AI advice, suggesting that their perceived credibility of ChatGPT was already near a behavioural floor. This finding supports the existence of a credibility threshold, beyond which additional cautionary cues have limited effect. Our results refine advice-taking theory and signal that simple warnings may be insufficient to ensure calibrated trust in AI-supported learning.
期刊介绍:
Medical Education seeks to be the pre-eminent journal in the field of education for health care professionals, and publishes material of the highest quality, reflecting world wide or provocative issues and perspectives.
The journal welcomes high quality papers on all aspects of health professional education including;
-undergraduate education
-postgraduate training
-continuing professional development
-interprofessional education