探讨心理健康领域的大语言模型偏差:性别和性取向对神经性厌食症和神经性贪食症病例插图影响的比较问卷研究。

IF 4.8 2区 医学 Q1 PSYCHIATRY
Jmir Mental Health Pub Date : 2025-03-20 DOI:10.2196/57986
Rebekka Schnepper, Noa Roemmel, Rainer Schaefert, Lena Lambrecht-Walzinger, Gunther Meinlschmidt
{"title":"探讨心理健康领域的大语言模型偏差:性别和性取向对神经性厌食症和神经性贪食症病例插图影响的比较问卷研究。","authors":"Rebekka Schnepper, Noa Roemmel, Rainer Schaefert, Lena Lambrecht-Walzinger, Gunther Meinlschmidt","doi":"10.2196/57986","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) are increasingly used in mental health, showing promise in assessing disorders. However, concerns exist regarding their accuracy, reliability, and fairness. Societal biases and underrepresentation of certain populations may impact LLMs. Because LLMs are already used for clinical practice, including decision support, it is important to investigate potential biases to ensure a responsible use of LLMs. Anorexia nervosa (AN) and bulimia nervosa (BN) show a lifetime prevalence of 1%-2%, affecting more women than men. Among men, homosexual men face a higher risk of eating disorders (EDs) than heterosexual men. However, men are underrepresented in ED research, and studies on gender, sexual orientation, and their impact on AN and BN prevalence, symptoms, and treatment outcomes remain limited.</p><p><strong>Objectives: </strong>We aimed to estimate the presence and size of bias related to gender and sexual orientation produced by a common LLM as well as a smaller LLM specifically trained for mental health analyses, exemplified in the context of ED symptomatology and health-related quality of life (HRQoL) of patients with AN or BN.</p><p><strong>Methods: </strong>We extracted 30 case vignettes (22 AN and 8 BN) from scientific papers. We adapted each vignette to create 4 versions, describing a female versus male patient living with their female versus male partner (2 × 2 design), yielding 120 vignettes. We then fed each vignette into ChatGPT-4 and to \"MentaLLaMA\" based on the Large Language Model Meta AI (LLaMA) architecture thrice with the instruction to evaluate them by providing responses to 2 psychometric instruments, the RAND-36 questionnaire assessing HRQoL and the eating disorder examination questionnaire. With the resulting LLM-generated scores, we calculated multilevel models with a random intercept for gender and sexual orientation (accounting for within-vignette variance), nested in vignettes (accounting for between-vignette variance).</p><p><strong>Results: </strong>In ChatGPT-4, the multilevel model with 360 observations indicated a significant association with gender for the RAND-36 mental composite summary (conditional means: 12.8 for male and 15.1 for female cases; 95% CI of the effect -6.15 to -0.35; P=.04) but neither with sexual orientation (P=.71) nor with an interaction effect (P=.37). We found no indications for main effects of gender (conditional means: 5.65 for male and 5.61 for female cases; 95% CI -0.10 to 0.14; P=.88), sexual orientation (conditional means: 5.63 for heterosexual and 5.62 for homosexual cases; 95% CI -0.14 to 0.09; P=.67), or for an interaction effect (P=.61, 95% CI -0.11 to 0.19) for the eating disorder examination questionnaire overall score (conditional means 5.59-5.65 95% CIs 5.45 to 5.7). MentaLLaMA did not yield reliable results.</p><p><strong>Conclusions: </strong>LLM-generated mental HRQoL estimates for AN and BN case vignettes may be biased by gender, with male cases scoring lower despite no real-world evidence supporting this pattern. This highlights the risk of bias in generative artificial intelligence in the field of mental health. Understanding and mitigating biases related to gender and other factors, such as ethnicity, and socioeconomic status are crucial for responsible use in diagnostics and treatment recommendations.</p>","PeriodicalId":48616,"journal":{"name":"Jmir Mental Health","volume":"12 ","pages":"e57986"},"PeriodicalIF":4.8000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11949086/pdf/","citationCount":"0","resultStr":"{\"title\":\"Exploring Biases of Large Language Models in the Field of Mental Health: Comparative Questionnaire Study of the Effect of Gender and Sexual Orientation in Anorexia Nervosa and Bulimia Nervosa Case Vignettes.\",\"authors\":\"Rebekka Schnepper, Noa Roemmel, Rainer Schaefert, Lena Lambrecht-Walzinger, Gunther Meinlschmidt\",\"doi\":\"10.2196/57986\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Large language models (LLMs) are increasingly used in mental health, showing promise in assessing disorders. However, concerns exist regarding their accuracy, reliability, and fairness. Societal biases and underrepresentation of certain populations may impact LLMs. Because LLMs are already used for clinical practice, including decision support, it is important to investigate potential biases to ensure a responsible use of LLMs. Anorexia nervosa (AN) and bulimia nervosa (BN) show a lifetime prevalence of 1%-2%, affecting more women than men. Among men, homosexual men face a higher risk of eating disorders (EDs) than heterosexual men. However, men are underrepresented in ED research, and studies on gender, sexual orientation, and their impact on AN and BN prevalence, symptoms, and treatment outcomes remain limited.</p><p><strong>Objectives: </strong>We aimed to estimate the presence and size of bias related to gender and sexual orientation produced by a common LLM as well as a smaller LLM specifically trained for mental health analyses, exemplified in the context of ED symptomatology and health-related quality of life (HRQoL) of patients with AN or BN.</p><p><strong>Methods: </strong>We extracted 30 case vignettes (22 AN and 8 BN) from scientific papers. We adapted each vignette to create 4 versions, describing a female versus male patient living with their female versus male partner (2 × 2 design), yielding 120 vignettes. We then fed each vignette into ChatGPT-4 and to \\\"MentaLLaMA\\\" based on the Large Language Model Meta AI (LLaMA) architecture thrice with the instruction to evaluate them by providing responses to 2 psychometric instruments, the RAND-36 questionnaire assessing HRQoL and the eating disorder examination questionnaire. With the resulting LLM-generated scores, we calculated multilevel models with a random intercept for gender and sexual orientation (accounting for within-vignette variance), nested in vignettes (accounting for between-vignette variance).</p><p><strong>Results: </strong>In ChatGPT-4, the multilevel model with 360 observations indicated a significant association with gender for the RAND-36 mental composite summary (conditional means: 12.8 for male and 15.1 for female cases; 95% CI of the effect -6.15 to -0.35; P=.04) but neither with sexual orientation (P=.71) nor with an interaction effect (P=.37). We found no indications for main effects of gender (conditional means: 5.65 for male and 5.61 for female cases; 95% CI -0.10 to 0.14; P=.88), sexual orientation (conditional means: 5.63 for heterosexual and 5.62 for homosexual cases; 95% CI -0.14 to 0.09; P=.67), or for an interaction effect (P=.61, 95% CI -0.11 to 0.19) for the eating disorder examination questionnaire overall score (conditional means 5.59-5.65 95% CIs 5.45 to 5.7). MentaLLaMA did not yield reliable results.</p><p><strong>Conclusions: </strong>LLM-generated mental HRQoL estimates for AN and BN case vignettes may be biased by gender, with male cases scoring lower despite no real-world evidence supporting this pattern. This highlights the risk of bias in generative artificial intelligence in the field of mental health. Understanding and mitigating biases related to gender and other factors, such as ethnicity, and socioeconomic status are crucial for responsible use in diagnostics and treatment recommendations.</p>\",\"PeriodicalId\":48616,\"journal\":{\"name\":\"Jmir Mental Health\",\"volume\":\"12 \",\"pages\":\"e57986\"},\"PeriodicalIF\":4.8000,\"publicationDate\":\"2025-03-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11949086/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Jmir Mental Health\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/57986\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"PSYCHIATRY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jmir Mental Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/57986","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHIATRY","Score":null,"Total":0}
引用次数: 0

摘要

背景:大型语言模型(LLMs)越来越多地用于心理健康,在评估障碍方面显示出希望。然而,存在对其准确性、可靠性和公平性的担忧。社会偏见和某些人群的代表性不足可能会影响法学硕士。由于法学硕士已经用于临床实践,包括决策支持,因此调查潜在的偏见以确保负责任地使用法学硕士是很重要的。神经性厌食症(AN)和神经性贪食症(BN)的终生患病率为1%-2%,女性多于男性。在男性中,同性恋男性比异性恋男性面临更高的饮食失调(EDs)风险。然而,男性在ED研究中的代表性不足,关于性别、性取向及其对AN和BN患病率、症状和治疗结果的影响的研究仍然有限。目的:我们旨在估计普通法学硕士和专门训练心理健康分析的小型法学硕士产生的与性别和性取向相关的偏倚的存在和大小,以AN或BN患者的ED症状学和健康相关生活质量(HRQoL)为例。方法:从科学文献中提取30例样本(22例AN, 8例BN)。我们将每个小插图改编为4个版本,分别描述与女性伴侣和男性伴侣一起生活的女性和男性患者(2 × 2设计),共产生120个小插图。然后,我们将每个小插曲输入ChatGPT-4和基于大型语言模型元人工智能(LLaMA)架构的“MentaLLaMA”三次,并指示通过提供2种心理测量工具、评估HRQoL的RAND-36问卷和饮食失调检查问卷来评估它们。利用llm生成的分数,我们计算了具有性别和性取向随机截距的多层模型(考虑小片段内方差),嵌套在小片段中(考虑小片段之间方差)。结果:在ChatGPT-4中,360个观察值的多水平模型表明RAND-36心理综合总结与性别有显著关联(条件均值:男性12.8,女性15.1;95%置信区间为-6.15 ~ -0.35;P=.04),但与性取向(P=.71)和相互作用效应(P=.37)均无关系。我们没有发现性别主要影响的迹象(条件均值:男性5.65,女性5.61;95% CI -0.10 ~ 0.14;P=.88),性取向(条件均值:异性恋5.63,同性恋5.62;95% CI -0.14 ~ 0.09;P= 0.67),或相互作用效应(P= 0.67)。61, 95% CI -0.11 ~ 0.19),进食障碍检查问卷总分(条件均值5.59 ~ 5.65 95% CI 5.45 ~ 5.7)。MentaLLaMA没有产生可靠的结果。结论:llm生成的AN和BN病例的心理HRQoL估计可能因性别而有偏差,尽管没有实际证据支持这种模式,但男性病例得分较低。这凸显了心理健康领域中生成式人工智能存在偏见的风险。了解和减轻与性别和其他因素(如种族和社会经济地位)有关的偏见对于在诊断和治疗建议中负责任地使用至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Exploring Biases of Large Language Models in the Field of Mental Health: Comparative Questionnaire Study of the Effect of Gender and Sexual Orientation in Anorexia Nervosa and Bulimia Nervosa Case Vignettes.

Background: Large language models (LLMs) are increasingly used in mental health, showing promise in assessing disorders. However, concerns exist regarding their accuracy, reliability, and fairness. Societal biases and underrepresentation of certain populations may impact LLMs. Because LLMs are already used for clinical practice, including decision support, it is important to investigate potential biases to ensure a responsible use of LLMs. Anorexia nervosa (AN) and bulimia nervosa (BN) show a lifetime prevalence of 1%-2%, affecting more women than men. Among men, homosexual men face a higher risk of eating disorders (EDs) than heterosexual men. However, men are underrepresented in ED research, and studies on gender, sexual orientation, and their impact on AN and BN prevalence, symptoms, and treatment outcomes remain limited.

Objectives: We aimed to estimate the presence and size of bias related to gender and sexual orientation produced by a common LLM as well as a smaller LLM specifically trained for mental health analyses, exemplified in the context of ED symptomatology and health-related quality of life (HRQoL) of patients with AN or BN.

Methods: We extracted 30 case vignettes (22 AN and 8 BN) from scientific papers. We adapted each vignette to create 4 versions, describing a female versus male patient living with their female versus male partner (2 × 2 design), yielding 120 vignettes. We then fed each vignette into ChatGPT-4 and to "MentaLLaMA" based on the Large Language Model Meta AI (LLaMA) architecture thrice with the instruction to evaluate them by providing responses to 2 psychometric instruments, the RAND-36 questionnaire assessing HRQoL and the eating disorder examination questionnaire. With the resulting LLM-generated scores, we calculated multilevel models with a random intercept for gender and sexual orientation (accounting for within-vignette variance), nested in vignettes (accounting for between-vignette variance).

Results: In ChatGPT-4, the multilevel model with 360 observations indicated a significant association with gender for the RAND-36 mental composite summary (conditional means: 12.8 for male and 15.1 for female cases; 95% CI of the effect -6.15 to -0.35; P=.04) but neither with sexual orientation (P=.71) nor with an interaction effect (P=.37). We found no indications for main effects of gender (conditional means: 5.65 for male and 5.61 for female cases; 95% CI -0.10 to 0.14; P=.88), sexual orientation (conditional means: 5.63 for heterosexual and 5.62 for homosexual cases; 95% CI -0.14 to 0.09; P=.67), or for an interaction effect (P=.61, 95% CI -0.11 to 0.19) for the eating disorder examination questionnaire overall score (conditional means 5.59-5.65 95% CIs 5.45 to 5.7). MentaLLaMA did not yield reliable results.

Conclusions: LLM-generated mental HRQoL estimates for AN and BN case vignettes may be biased by gender, with male cases scoring lower despite no real-world evidence supporting this pattern. This highlights the risk of bias in generative artificial intelligence in the field of mental health. Understanding and mitigating biases related to gender and other factors, such as ethnicity, and socioeconomic status are crucial for responsible use in diagnostics and treatment recommendations.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Jmir Mental Health
Jmir Mental Health Medicine-Psychiatry and Mental Health
CiteScore
10.80
自引率
3.80%
发文量
104
审稿时长
16 weeks
期刊介绍: JMIR Mental Health (JMH, ISSN 2368-7959) is a PubMed-indexed, peer-reviewed sister journal of JMIR, the leading eHealth journal (Impact Factor 2016: 5.175). JMIR Mental Health focusses on digital health and Internet interventions, technologies and electronic innovations (software and hardware) for mental health, addictions, online counselling and behaviour change. This includes formative evaluation and system descriptions, theoretical papers, review papers, viewpoint/vision papers, and rigorous evaluations.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信