Comparing Generative Artificial Intelligence and Mental Health Professionals for Clinical Decision-Making With Trauma-Exposed Populations: Vignette-Based Experimental Study.
Katherine E Wislocki, Sabahat Sami, Gahl Liberzon, Alyson K Zalta
{"title":"Comparing Generative Artificial Intelligence and Mental Health Professionals for Clinical Decision-Making With Trauma-Exposed Populations: Vignette-Based Experimental Study.","authors":"Katherine E Wislocki, Sabahat Sami, Gahl Liberzon, Alyson K Zalta","doi":"10.2196/80801","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Trauma exposure is highly prevalent and associated with various health issues. However, health care professionals can exhibit trauma-related diagnostic overshadowing bias, leading to misdiagnosis and inadequate treatment of trauma-exposed populations. Generative artificial intelligence (GAI) models are increasingly used in health care contexts. No research has examined whether GAI demonstrates this bias in decision-making and how rates of this bias may compare to mental health professionals (MHPs).</p><p><strong>Objective: </strong>This study aimed to assess trauma-related diagnostic overshadowing among frontier GAI models and compare evidence of trauma-related diagnostic overshadowing between frontier GAI models and MHPs.</p><p><strong>Methods: </strong>MHPs (N=232; mean [SD] age 43.7 [15.95] years) completed an experimental paradigm consisting of 2 vignettes describing adults presenting with obsessive-compulsive symptoms or substance abuse symptoms. One vignette included a trauma exposure history (ie, sexual trauma or physical trauma), and one vignette did not include a trauma exposure history. Participants answered questions about their preferences for diagnosis and treatment options for clients within the vignettes. GAI models (eg, Gemini 1.5 Flash, ChatGPT-4o mini, Claude Sonnet, and Meta Llama 3) completed the same experimental paradigm, with each block being reviewed by each GAI model 20 times. Mann-Whitney U tests and chi-square analyses were used to assess diagnostic and treatment decision-making across vignette factors and respondents.</p><p><strong>Results: </strong>GAI models, similar to MHPs, demonstrated some evidence of trauma-related diagnostic overshadowing bias, particularly in Likert-based ratings of posttraumatic stress disorder diagnosis and treatment when sexual trauma was present (P<.001). However, GAI models generally exhibited significantly less bias than MHPs across both Likert and forced-choice clinical decision tasks. Compared to MHPs, GAI models assigned higher ratings for the target diagnosis and treatment in obsessive-compulsive disorder vignettes (rb=0.43-0.63; P<.001) and for the target treatment in substance use disorder vignettes (rb=0.57; P<.001) when trauma was present. In forced-choice tasks, GAI models were significantly more accurate than MHPs in selecting the correct diagnosis and treatment for obsessive-compulsive disorder vignettes (χ²1=48.84-61.07; P<.001) and for substance use disorder vignettes involving sexual trauma (χ²1=15.17-101.61; P<.001).</p><p><strong>Conclusions: </strong>GAI models demonstrate some evidence of trauma-related diagnostic overshadowing bias, yet the degree of bias varied by task and model. Moreover, GAI models generally demonstrated less bias than MHPs in this experimental paradigm. These findings highlight the importance of understanding GAI biases in mental health care. More research into bias reduction strategies and responsible implementation of GAI models in mental health care is needed.</p>","PeriodicalId":48616,"journal":{"name":"Jmir Mental Health","volume":"12 ","pages":"e80801"},"PeriodicalIF":5.8000,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12527320/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jmir Mental Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/80801","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHIATRY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Trauma exposure is highly prevalent and associated with various health issues. However, health care professionals can exhibit trauma-related diagnostic overshadowing bias, leading to misdiagnosis and inadequate treatment of trauma-exposed populations. Generative artificial intelligence (GAI) models are increasingly used in health care contexts. No research has examined whether GAI demonstrates this bias in decision-making and how rates of this bias may compare to mental health professionals (MHPs).
Objective: This study aimed to assess trauma-related diagnostic overshadowing among frontier GAI models and compare evidence of trauma-related diagnostic overshadowing between frontier GAI models and MHPs.
Methods: MHPs (N=232; mean [SD] age 43.7 [15.95] years) completed an experimental paradigm consisting of 2 vignettes describing adults presenting with obsessive-compulsive symptoms or substance abuse symptoms. One vignette included a trauma exposure history (ie, sexual trauma or physical trauma), and one vignette did not include a trauma exposure history. Participants answered questions about their preferences for diagnosis and treatment options for clients within the vignettes. GAI models (eg, Gemini 1.5 Flash, ChatGPT-4o mini, Claude Sonnet, and Meta Llama 3) completed the same experimental paradigm, with each block being reviewed by each GAI model 20 times. Mann-Whitney U tests and chi-square analyses were used to assess diagnostic and treatment decision-making across vignette factors and respondents.
Results: GAI models, similar to MHPs, demonstrated some evidence of trauma-related diagnostic overshadowing bias, particularly in Likert-based ratings of posttraumatic stress disorder diagnosis and treatment when sexual trauma was present (P<.001). However, GAI models generally exhibited significantly less bias than MHPs across both Likert and forced-choice clinical decision tasks. Compared to MHPs, GAI models assigned higher ratings for the target diagnosis and treatment in obsessive-compulsive disorder vignettes (rb=0.43-0.63; P<.001) and for the target treatment in substance use disorder vignettes (rb=0.57; P<.001) when trauma was present. In forced-choice tasks, GAI models were significantly more accurate than MHPs in selecting the correct diagnosis and treatment for obsessive-compulsive disorder vignettes (χ²1=48.84-61.07; P<.001) and for substance use disorder vignettes involving sexual trauma (χ²1=15.17-101.61; P<.001).
Conclusions: GAI models demonstrate some evidence of trauma-related diagnostic overshadowing bias, yet the degree of bias varied by task and model. Moreover, GAI models generally demonstrated less bias than MHPs in this experimental paradigm. These findings highlight the importance of understanding GAI biases in mental health care. More research into bias reduction strategies and responsible implementation of GAI models in mental health care is needed.
期刊介绍:
JMIR Mental Health (JMH, ISSN 2368-7959) is a PubMed-indexed, peer-reviewed sister journal of JMIR, the leading eHealth journal (Impact Factor 2016: 5.175).
JMIR Mental Health focusses on digital health and Internet interventions, technologies and electronic innovations (software and hardware) for mental health, addictions, online counselling and behaviour change. This includes formative evaluation and system descriptions, theoretical papers, review papers, viewpoint/vision papers, and rigorous evaluations.