一项使用Asch范式检验精神病学评估中大语言模型一致性的对照试验。

IF 3.4 2区医学 Q2 PSYCHIATRY

BMC Psychiatry Pub Date : 2025-05-12 DOI:10.1186/s12888-025-06912-2

Dorit Hadar Shoval, Karny Gigi, Yuval Haber, Amir Itzhaki, Kfir Asraf, David Piterman, Zohar Elyoseph

{"title":"一项使用Asch范式检验精神病学评估中大语言模型一致性的对照试验。","authors":"Dorit Hadar Shoval, Karny Gigi, Yuval Haber, Amir Itzhaki, Kfir Asraf, David Piterman, Zohar Elyoseph","doi":"10.1186/s12888-025-06912-2","DOIUrl":null,"url":null,"abstract":"Background: Despite significant advances in AI-driven medical diagnostics, the integration of large language models (LLMs) into psychiatric practice presents unique challenges. While LLMs demonstrate high accuracy in controlled settings, their performance in collaborative clinical environments remains unclear. This study examined whether LLMs exhibit conformity behavior under social pressure across different diagnostic certainty levels, with a particular focus on psychiatric assessment.Methods: Using an adapted Asch paradigm, we conducted a controlled trial examining GPT-4o's performance across three domains representing increasing levels of diagnostic uncertainty: circle similarity judgments (high certainty), brain tumor identification (intermediate certainty), and psychiatric assessment using children's drawings (high uncertainty). The study employed a 3 × 3 factorial design with three pressure conditions: no pressure, full pressure (five consecutive incorrect peer responses), and partial pressure (mixed correct and incorrect peer responses). We conducted 10 trials per condition combination (90 total observations), using standardized prompts and multiple-choice responses. The binomial test and chi-square analyses assessed performance differences across conditions.Results: Under no pressure, GPT-4o achieved 100% accuracy across all domains. Under full pressure, accuracy declined systematically with increasing diagnostic uncertainty: 50% in circle recognition, 40% in tumor identification, and 0% in psychiatric assessment. Partial pressure showed a similar pattern, with maintained accuracy in basic tasks (80% in circle recognition, 100% in tumor identification) but complete failure in psychiatric assessment (0%). All differences between no pressure and pressure conditions were statistically significant (P <.05), with the most severe effects observed in psychiatric assessment (χ²₁=16.20, P <.001).Conclusions: This study reveals that LLMs exhibit conformity patterns that intensify with diagnostic uncertainty, culminating in complete performance failure in psychiatric assessment under social pressure. These findings suggest that successful implementation of AI in psychiatry requires careful consideration of social dynamics and the inherent uncertainty in psychiatric diagnosis. Future research should validate these findings across different AI systems and diagnostic tools while developing strategies to maintain AI independence in clinical settings.Trial registration: Not applicable.","PeriodicalId":9029,"journal":{"name":"BMC Psychiatry","volume":"25 1","pages":"478"},"PeriodicalIF":3.4000,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12070653/pdf/","citationCount":"0","resultStr":"{\"title\":\"A controlled trial examining large Language model conformity in psychiatric assessment using the Asch paradigm.\",\"authors\":\"Dorit Hadar Shoval, Karny Gigi, Yuval Haber, Amir Itzhaki, Kfir Asraf, David Piterman, Zohar Elyoseph\",\"doi\":\"10.1186/s12888-025-06912-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Despite significant advances in AI-driven medical diagnostics, the integration of large language models (LLMs) into psychiatric practice presents unique challenges. While LLMs demonstrate high accuracy in controlled settings, their performance in collaborative clinical environments remains unclear. This study examined whether LLMs exhibit conformity behavior under social pressure across different diagnostic certainty levels, with a particular focus on psychiatric assessment.Methods: Using an adapted Asch paradigm, we conducted a controlled trial examining GPT-4o's performance across three domains representing increasing levels of diagnostic uncertainty: circle similarity judgments (high certainty), brain tumor identification (intermediate certainty), and psychiatric assessment using children's drawings (high uncertainty). The study employed a 3 × 3 factorial design with three pressure conditions: no pressure, full pressure (five consecutive incorrect peer responses), and partial pressure (mixed correct and incorrect peer responses). We conducted 10 trials per condition combination (90 total observations), using standardized prompts and multiple-choice responses. The binomial test and chi-square analyses assessed performance differences across conditions.Results: Under no pressure, GPT-4o achieved 100% accuracy across all domains. Under full pressure, accuracy declined systematically with increasing diagnostic uncertainty: 50% in circle recognition, 40% in tumor identification, and 0% in psychiatric assessment. Partial pressure showed a similar pattern, with maintained accuracy in basic tasks (80% in circle recognition, 100% in tumor identification) but complete failure in psychiatric assessment (0%). All differences between no pressure and pressure conditions were statistically significant (P <.05), with the most severe effects observed in psychiatric assessment (χ²₁=16.20, P <.001).Conclusions: This study reveals that LLMs exhibit conformity patterns that intensify with diagnostic uncertainty, culminating in complete performance failure in psychiatric assessment under social pressure. These findings suggest that successful implementation of AI in psychiatry requires careful consideration of social dynamics and the inherent uncertainty in psychiatric diagnosis. Future research should validate these findings across different AI systems and diagnostic tools while developing strategies to maintain AI independence in clinical settings.Trial registration: Not applicable.\",\"PeriodicalId\":9029,\"journal\":{\"name\":\"BMC Psychiatry\",\"volume\":\"25 1\",\"pages\":\"478\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-05-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12070653/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Psychiatry\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12888-025-06912-2\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PSYCHIATRY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Psychiatry","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12888-025-06912-2","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PSYCHIATRY","Score":null,"Total":0}

引用次数: 0

摘要

背景：尽管人工智能驱动的医学诊断取得了重大进展，但将大型语言模型（llm）整合到精神病学实践中提出了独特的挑战。虽然法学硕士在受控环境中表现出很高的准确性，但它们在协作临床环境中的表现仍不清楚。本研究考察了llm在不同诊断确定性水平的社会压力下是否表现出从众行为，并特别关注精神病学评估。方法：采用Asch范式，我们进行了一项对照试验，检查gpt - 40在三个领域的表现，这些领域代表着越来越高的诊断不确定性：圆相似性判断（高确定性）、脑肿瘤识别（中等确定性）和使用儿童绘画进行精神评估（高不确定性）。本研究采用3 × 3因子设计，有三种压力条件：无压力、全压力（连续5个错误的同伴反应）和部分压力（混合正确和错误的同伴反应）。我们对每种情况组合进行了10次试验（共90次观察），使用标准化提示和多项选择回答。二项检验和卡方分析评估了不同条件下的性能差异。结果：在没有压力的情况下，gpt - 40在所有领域都达到了100%的准确率。在全压力下，准确率随着诊断不确定性的增加而系统性下降：圆圈识别为50%，肿瘤识别为40%，精神评估为0%。分压表现出类似的模式，在基本任务中保持准确性（圆圈识别80%，肿瘤识别100%），但在精神评估中完全失败（0%）。结论：本研究揭示llm表现出从众模式，随着诊断的不确定性而加剧，最终在社会压力下的精神评估中表现完全失败。这些发现表明，在精神病学中成功实施人工智能需要仔细考虑社会动态和精神病学诊断中固有的不确定性。未来的研究应该在不同的人工智能系统和诊断工具中验证这些发现，同时制定策略，保持人工智能在临床环境中的独立性。试验注册：不适用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A controlled trial examining large Language model conformity in psychiatric assessment using the Asch paradigm.

Background: Despite significant advances in AI-driven medical diagnostics, the integration of large language models (LLMs) into psychiatric practice presents unique challenges. While LLMs demonstrate high accuracy in controlled settings, their performance in collaborative clinical environments remains unclear. This study examined whether LLMs exhibit conformity behavior under social pressure across different diagnostic certainty levels, with a particular focus on psychiatric assessment.

Methods: Using an adapted Asch paradigm, we conducted a controlled trial examining GPT-4o's performance across three domains representing increasing levels of diagnostic uncertainty: circle similarity judgments (high certainty), brain tumor identification (intermediate certainty), and psychiatric assessment using children's drawings (high uncertainty). The study employed a 3 × 3 factorial design with three pressure conditions: no pressure, full pressure (five consecutive incorrect peer responses), and partial pressure (mixed correct and incorrect peer responses). We conducted 10 trials per condition combination (90 total observations), using standardized prompts and multiple-choice responses. The binomial test and chi-square analyses assessed performance differences across conditions.

Results: Under no pressure, GPT-4o achieved 100% accuracy across all domains. Under full pressure, accuracy declined systematically with increasing diagnostic uncertainty: 50% in circle recognition, 40% in tumor identification, and 0% in psychiatric assessment. Partial pressure showed a similar pattern, with maintained accuracy in basic tasks (80% in circle recognition, 100% in tumor identification) but complete failure in psychiatric assessment (0%). All differences between no pressure and pressure conditions were statistically significant (P <.05), with the most severe effects observed in psychiatric assessment (χ²₁=16.20, P <.001).

Conclusions: This study reveals that LLMs exhibit conformity patterns that intensify with diagnostic uncertainty, culminating in complete performance failure in psychiatric assessment under social pressure. These findings suggest that successful implementation of AI in psychiatry requires careful consideration of social dynamics and the inherent uncertainty in psychiatric diagnosis. Future research should validate these findings across different AI systems and diagnostic tools while developing strategies to maintain AI independence in clinical settings.

Trial registration: Not applicable.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Psychiatry 医学-精神病学

CiteScore

5.90

自引率

4.50%

发文量

716

审稿时长

3-6 weeks

期刊介绍： BMC Psychiatry is an open access, peer-reviewed journal that considers articles on all aspects of the prevention, diagnosis and management of psychiatric disorders, as well as related molecular genetics, pathophysiology, and epidemiology.