A controlled trial examining large Language model conformity in psychiatric assessment using the Asch paradigm.

IF 3.4 2区 医学 Q2 PSYCHIATRY
Dorit Hadar Shoval, Karny Gigi, Yuval Haber, Amir Itzhaki, Kfir Asraf, David Piterman, Zohar Elyoseph
{"title":"A controlled trial examining large Language model conformity in psychiatric assessment using the Asch paradigm.","authors":"Dorit Hadar Shoval, Karny Gigi, Yuval Haber, Amir Itzhaki, Kfir Asraf, David Piterman, Zohar Elyoseph","doi":"10.1186/s12888-025-06912-2","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Despite significant advances in AI-driven medical diagnostics, the integration of large language models (LLMs) into psychiatric practice presents unique challenges. While LLMs demonstrate high accuracy in controlled settings, their performance in collaborative clinical environments remains unclear. This study examined whether LLMs exhibit conformity behavior under social pressure across different diagnostic certainty levels, with a particular focus on psychiatric assessment.</p><p><strong>Methods: </strong>Using an adapted Asch paradigm, we conducted a controlled trial examining GPT-4o's performance across three domains representing increasing levels of diagnostic uncertainty: circle similarity judgments (high certainty), brain tumor identification (intermediate certainty), and psychiatric assessment using children's drawings (high uncertainty). The study employed a 3 × 3 factorial design with three pressure conditions: no pressure, full pressure (five consecutive incorrect peer responses), and partial pressure (mixed correct and incorrect peer responses). We conducted 10 trials per condition combination (90 total observations), using standardized prompts and multiple-choice responses. The binomial test and chi-square analyses assessed performance differences across conditions.</p><p><strong>Results: </strong>Under no pressure, GPT-4o achieved 100% accuracy across all domains. Under full pressure, accuracy declined systematically with increasing diagnostic uncertainty: 50% in circle recognition, 40% in tumor identification, and 0% in psychiatric assessment. Partial pressure showed a similar pattern, with maintained accuracy in basic tasks (80% in circle recognition, 100% in tumor identification) but complete failure in psychiatric assessment (0%). All differences between no pressure and pressure conditions were statistically significant (P <.05), with the most severe effects observed in psychiatric assessment (χ²₁=16.20, P <.001).</p><p><strong>Conclusions: </strong>This study reveals that LLMs exhibit conformity patterns that intensify with diagnostic uncertainty, culminating in complete performance failure in psychiatric assessment under social pressure. These findings suggest that successful implementation of AI in psychiatry requires careful consideration of social dynamics and the inherent uncertainty in psychiatric diagnosis. Future research should validate these findings across different AI systems and diagnostic tools while developing strategies to maintain AI independence in clinical settings.</p><p><strong>Trial registration: </strong>Not applicable.</p>","PeriodicalId":9029,"journal":{"name":"BMC Psychiatry","volume":"25 1","pages":"478"},"PeriodicalIF":3.4000,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12070653/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Psychiatry","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12888-025-06912-2","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PSYCHIATRY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Despite significant advances in AI-driven medical diagnostics, the integration of large language models (LLMs) into psychiatric practice presents unique challenges. While LLMs demonstrate high accuracy in controlled settings, their performance in collaborative clinical environments remains unclear. This study examined whether LLMs exhibit conformity behavior under social pressure across different diagnostic certainty levels, with a particular focus on psychiatric assessment.

Methods: Using an adapted Asch paradigm, we conducted a controlled trial examining GPT-4o's performance across three domains representing increasing levels of diagnostic uncertainty: circle similarity judgments (high certainty), brain tumor identification (intermediate certainty), and psychiatric assessment using children's drawings (high uncertainty). The study employed a 3 × 3 factorial design with three pressure conditions: no pressure, full pressure (five consecutive incorrect peer responses), and partial pressure (mixed correct and incorrect peer responses). We conducted 10 trials per condition combination (90 total observations), using standardized prompts and multiple-choice responses. The binomial test and chi-square analyses assessed performance differences across conditions.

Results: Under no pressure, GPT-4o achieved 100% accuracy across all domains. Under full pressure, accuracy declined systematically with increasing diagnostic uncertainty: 50% in circle recognition, 40% in tumor identification, and 0% in psychiatric assessment. Partial pressure showed a similar pattern, with maintained accuracy in basic tasks (80% in circle recognition, 100% in tumor identification) but complete failure in psychiatric assessment (0%). All differences between no pressure and pressure conditions were statistically significant (P <.05), with the most severe effects observed in psychiatric assessment (χ²₁=16.20, P <.001).

Conclusions: This study reveals that LLMs exhibit conformity patterns that intensify with diagnostic uncertainty, culminating in complete performance failure in psychiatric assessment under social pressure. These findings suggest that successful implementation of AI in psychiatry requires careful consideration of social dynamics and the inherent uncertainty in psychiatric diagnosis. Future research should validate these findings across different AI systems and diagnostic tools while developing strategies to maintain AI independence in clinical settings.

Trial registration: Not applicable.

一项使用Asch范式检验精神病学评估中大语言模型一致性的对照试验。
背景:尽管人工智能驱动的医学诊断取得了重大进展,但将大型语言模型(llm)整合到精神病学实践中提出了独特的挑战。虽然法学硕士在受控环境中表现出很高的准确性,但它们在协作临床环境中的表现仍不清楚。本研究考察了llm在不同诊断确定性水平的社会压力下是否表现出从众行为,并特别关注精神病学评估。方法:采用Asch范式,我们进行了一项对照试验,检查gpt - 40在三个领域的表现,这些领域代表着越来越高的诊断不确定性:圆相似性判断(高确定性)、脑肿瘤识别(中等确定性)和使用儿童绘画进行精神评估(高不确定性)。本研究采用3 × 3因子设计,有三种压力条件:无压力、全压力(连续5个错误的同伴反应)和部分压力(混合正确和错误的同伴反应)。我们对每种情况组合进行了10次试验(共90次观察),使用标准化提示和多项选择回答。二项检验和卡方分析评估了不同条件下的性能差异。结果:在没有压力的情况下,gpt - 40在所有领域都达到了100%的准确率。在全压力下,准确率随着诊断不确定性的增加而系统性下降:圆圈识别为50%,肿瘤识别为40%,精神评估为0%。分压表现出类似的模式,在基本任务中保持准确性(圆圈识别80%,肿瘤识别100%),但在精神评估中完全失败(0%)。结论:本研究揭示llm表现出从众模式,随着诊断的不确定性而加剧,最终在社会压力下的精神评估中表现完全失败。这些发现表明,在精神病学中成功实施人工智能需要仔细考虑社会动态和精神病学诊断中固有的不确定性。未来的研究应该在不同的人工智能系统和诊断工具中验证这些发现,同时制定策略,保持人工智能在临床环境中的独立性。试验注册:不适用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
BMC Psychiatry
BMC Psychiatry 医学-精神病学
CiteScore
5.90
自引率
4.50%
发文量
716
审稿时长
3-6 weeks
期刊介绍: BMC Psychiatry is an open access, peer-reviewed journal that considers articles on all aspects of the prevention, diagnosis and management of psychiatric disorders, as well as related molecular genetics, pathophysiology, and epidemiology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信