Large Language Models Versus Expert Clinicians in Crisis Prediction Among Telemental Health Patients: Comparative Study.

IF 4.8 2区医学 Q1 PSYCHIATRY

Jmir Mental Health Pub Date : 2024-08-02 DOI:10.2196/58129

Christine Lee, Matthew Mohebbi, Erin O'Callaghan, Mirène Winsberg

{"title":"Large Language Models Versus Expert Clinicians in Crisis Prediction Among Telemental Health Patients: Comparative Study.","authors":"Christine Lee, Matthew Mohebbi, Erin O'Callaghan, Mirène Winsberg","doi":"10.2196/58129","DOIUrl":null,"url":null,"abstract":"Background: Due to recent advances in artificial intelligence, large language models (LLMs) have emerged as a powerful tool for a variety of language-related tasks, including sentiment analysis, and summarization of provider-patient interactions. However, there is limited research on these models in the area of crisis prediction.Objective: This study aimed to evaluate the performance of LLMs, specifically OpenAI's generative pretrained transformer 4 (GPT-4), in predicting current and future mental health crisis episodes using patient-provided information at intake among users of a national telemental health platform.Methods: Deidentified patient-provided data were pulled from specific intake questions of the Brightside telehealth platform, including the chief complaint, for 140 patients who indicated suicidal ideation (SI), and another 120 patients who later indicated SI with a plan during the course of treatment. Similar data were pulled for 200 randomly selected patients, treated during the same time period, who never endorsed SI. In total, 6 senior Brightside clinicians (3 psychologists and 3 psychiatrists) were shown patients' self-reported chief complaint and self-reported suicide attempt history but were blinded to the future course of treatment and other reported symptoms, including SI. They were asked a simple yes or no question regarding their prediction of endorsement of SI with plan, along with their confidence level about the prediction. GPT-4 was provided with similar information and asked to answer the same questions, enabling us to directly compare the performance of artificial intelligence and clinicians.Results: Overall, the clinicians' average precision (0.7) was higher than that of GPT-4 (0.6) in identifying the SI with plan at intake (n=140) versus no SI (n=200) when using the chief complaint alone, while sensitivity was higher for the GPT-4 (0.62) than the clinicians' average (0.53). The addition of suicide attempt history increased the clinicians' average sensitivity (0.59) and precision (0.77) while increasing the GPT-4 sensitivity (0.59) but decreasing the GPT-4 precision (0.54). Performance decreased comparatively when predicting future SI with plan (n=120) versus no SI (n=200) with a chief complaint only for the clinicians (average sensitivity=0.4; average precision=0.59) and the GPT-4 (sensitivity=0.46; precision=0.48). The addition of suicide attempt history increased performance comparatively for the clinicians (average sensitivity=0.46; average precision=0.69) and the GPT-4 (sensitivity=0.74; precision=0.48).Conclusions: GPT-4, with a simple prompt design, produced results on some metrics that approached those of a trained clinician. Additional work must be done before such a model can be piloted in a clinical setting. The model should undergo safety checks for bias, given evidence that LLMs can perpetuate the biases of the underlying data on which they are trained. We believe that LLMs hold promise for augmenting the identification of higher-risk patients at intake and potentially delivering more timely care to patients.","PeriodicalId":48616,"journal":{"name":"Jmir Mental Health","volume":" ","pages":"e58129"},"PeriodicalIF":4.8000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11329850/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jmir Mental Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/58129","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHIATRY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Due to recent advances in artificial intelligence, large language models (LLMs) have emerged as a powerful tool for a variety of language-related tasks, including sentiment analysis, and summarization of provider-patient interactions. However, there is limited research on these models in the area of crisis prediction.

Objective: This study aimed to evaluate the performance of LLMs, specifically OpenAI's generative pretrained transformer 4 (GPT-4), in predicting current and future mental health crisis episodes using patient-provided information at intake among users of a national telemental health platform.

Methods: Deidentified patient-provided data were pulled from specific intake questions of the Brightside telehealth platform, including the chief complaint, for 140 patients who indicated suicidal ideation (SI), and another 120 patients who later indicated SI with a plan during the course of treatment. Similar data were pulled for 200 randomly selected patients, treated during the same time period, who never endorsed SI. In total, 6 senior Brightside clinicians (3 psychologists and 3 psychiatrists) were shown patients' self-reported chief complaint and self-reported suicide attempt history but were blinded to the future course of treatment and other reported symptoms, including SI. They were asked a simple yes or no question regarding their prediction of endorsement of SI with plan, along with their confidence level about the prediction. GPT-4 was provided with similar information and asked to answer the same questions, enabling us to directly compare the performance of artificial intelligence and clinicians.

Results: Overall, the clinicians' average precision (0.7) was higher than that of GPT-4 (0.6) in identifying the SI with plan at intake (n=140) versus no SI (n=200) when using the chief complaint alone, while sensitivity was higher for the GPT-4 (0.62) than the clinicians' average (0.53). The addition of suicide attempt history increased the clinicians' average sensitivity (0.59) and precision (0.77) while increasing the GPT-4 sensitivity (0.59) but decreasing the GPT-4 precision (0.54). Performance decreased comparatively when predicting future SI with plan (n=120) versus no SI (n=200) with a chief complaint only for the clinicians (average sensitivity=0.4; average precision=0.59) and the GPT-4 (sensitivity=0.46; precision=0.48). The addition of suicide attempt history increased performance comparatively for the clinicians (average sensitivity=0.46; average precision=0.69) and the GPT-4 (sensitivity=0.74; precision=0.48).

Conclusions: GPT-4, with a simple prompt design, produced results on some metrics that approached those of a trained clinician. Additional work must be done before such a model can be piloted in a clinical setting. The model should undergo safety checks for bias, given evidence that LLMs can perpetuate the biases of the underlying data on which they are trained. We believe that LLMs hold promise for augmenting the identification of higher-risk patients at intake and potentially delivering more timely care to patients.

查看原文本刊更多论文

远程心理健康患者的危机预测：大型语言模型与临床专家的比较。

背景：由于人工智能（AI）的最新进展，大型语言模型（LLMs）已成为各种语言相关任务的有力工具，包括情感分析和提供者与患者互动的总结。然而，在危机预测领域对这些模型的研究还很有限：本研究旨在评估 LLMs（特别是 OpenAI 的 GPT-4）在预测当前和未来心理健康危机事件方面的性能，这些 LLMs 在预测心理健康危机事件时使用的是全国远程医疗平台用户在入院时提供的患者信息：从 Brightside 远程医疗平台的特定入院问题（包括主诉）中提取了 140 名表示有自杀意念（SI）的患者和另外 120 名后来在治疗过程中表示有 SI 并制定了计划的患者的去身份化患者提供的数据。同时还随机抽取了 200 名在同一时期接受治疗但从未表示过自杀倾向的患者的类似数据。六位布莱特赛德公司的资深临床医生（三位心理学家和三位精神科医生）向他们展示了患者自述的主诉和自述的自杀未遂史，但他们对未来的治疗过程和其他报告的症状（包括 SI）视而不见。他们被问到一个简单的 "是/否 "问题，内容是关于他们是否预测患者会接受 SI 计划以及他们对预测的信心程度。我们向 GPT-4 提供了类似的信息，并要求他们回答同样的问题，这样我们就能直接比较人工智能和临床医生的表现：总体而言，在仅使用主诉识别入院时有计划的 SI（n=140）与无计划的 SI（n=200）时，临床医生的平均精确度（0.698）高于 GPT-4（0.596），而 GPT-4 的灵敏度（0.621）高于临床医生的平均值（0.529）。增加自杀未遂史可提高临床医生的平均灵敏度（0.590）和精确度（0.765），同时提高 GPT-4 的灵敏度（0.590），但降低 GPT-4 的精确度（0.544）。在预测未来有计划的 SI（n=120）与无计划的 SI（n=200）时，临床医生的主诉（平均灵敏度=0.399；平均精确度=0.594）和 GPT-4 的灵敏度=0.458；精确度=0.482）相对下降。增加自杀未遂史后，临床医生（平均灵敏度=0.457；平均精确度=0.687）和GPT-4（灵敏度=0.742；精确度=0.476）的成绩相对提高：结论：采用简单提示设计的 GPT-4 在某些指标上取得了接近训练有素的临床医生的结果。要在临床环境中试用这种模型，还必须做更多的工作。鉴于有证据表明 LLMs 会使其所训练的基础数据的偏差永久化，因此该模型应进行安全检查，以防偏差。我们相信，LLMs 有希望在入院时增强对高风险患者的识别能力，并有可能为患者提供更及时的护理：

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Jmir Mental Health Medicine-Psychiatry and Mental Health

CiteScore

10.80

自引率

3.80%

发文量

104

审稿时长

16 weeks

期刊介绍： JMIR Mental Health (JMH, ISSN 2368-7959) is a PubMed-indexed, peer-reviewed sister journal of JMIR, the leading eHealth journal (Impact Factor 2016: 5.175). JMIR Mental Health focusses on digital health and Internet interventions, technologies and electronic innovations (software and hardware) for mental health, addictions, online counselling and behaviour change. This includes formative evaluation and system descriptions, theoretical papers, review papers, viewpoint/vision papers, and rigorous evaluations.