使用大型语言模型估计记叙性临床记录中的抑郁严重程度。

IF 4.9 2区 医学 Q1 CLINICAL NEUROLOGY
Thomas H. McCoy, Victor M. Castro, Roy H. Perlis
{"title":"使用大型语言模型估计记叙性临床记录中的抑郁严重程度。","authors":"Thomas H. McCoy,&nbsp;Victor M. Castro,&nbsp;Roy H. Perlis","doi":"10.1016/j.jad.2025.04.014","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Depression treatment guidelines emphasize measurement-based care using patient-reported outcome measures, yet their impact on narrative documentation quality remains underexplored.</div></div><div><h3>Methods</h3><div>We sampled 15,000 narrative clinical outpatient notes from the electronic health record of a large academic medical center, reflecting visits between January 2, 2019 and January 30, 2024, for which a 9-item Patient Health Questionnaire (PHQ-9) was completed at the same time. After censoring PHQ-9 scores from notes, we estimated severity of depressive symptoms with a foundational large language model (gpt4o-08-06) in a HIPAA-compliant enclave. We estimated correlation between true PHQ-9 and model-estimated score and examined the predictive performance of the model for moderate or greater depressive symptoms.</div></div><div><h3>Results</h3><div>Mean age was 46.3 years (SD 14.9); 9083 (60.6 %) identified as female. 925 (6.2 %) identified as Asian, 638 (4.3 %) as Black, 853 (5.7 %) as another race, and 12,187 (81.2 %) as White. A total of 1044 (7.0 %) identified as Hispanic ethnicity, while 12,699 (84.7 %) were non-Hispanic. Mean measured PHQ-9 score was 1.23 (SD 3.45); 721 (4.8 %) met criteria for moderate or greater depressive symptoms. LLM-predicted PHQ-9 scores were modestly correlated with actual scores (r<sup>2</sup> = 0.264 (95 % CI 0.252–0.276)); PPV for moderate or greater depression was 0.309 (95 % CI 0.302–0.317). Performance was consistent across demographic subgroups, with modest differences identified by race, ethnicity, and sex.</div></div><div><h3>Conclusion</h3><div>A foundational LLM performed poorly but consistently across subgroups in imputing PHQ-9 scores from notes when actual PHQ-9 reporting was ablated. This result suggests the extent to which inclusion of PROMs may impoverish documentation of psychiatric symptoms.</div></div>","PeriodicalId":14963,"journal":{"name":"Journal of affective disorders","volume":"381 ","pages":"Pages 270-274"},"PeriodicalIF":4.9000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Estimating depression severity in narrative clinical notes using large language models\",\"authors\":\"Thomas H. McCoy,&nbsp;Victor M. Castro,&nbsp;Roy H. Perlis\",\"doi\":\"10.1016/j.jad.2025.04.014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Depression treatment guidelines emphasize measurement-based care using patient-reported outcome measures, yet their impact on narrative documentation quality remains underexplored.</div></div><div><h3>Methods</h3><div>We sampled 15,000 narrative clinical outpatient notes from the electronic health record of a large academic medical center, reflecting visits between January 2, 2019 and January 30, 2024, for which a 9-item Patient Health Questionnaire (PHQ-9) was completed at the same time. After censoring PHQ-9 scores from notes, we estimated severity of depressive symptoms with a foundational large language model (gpt4o-08-06) in a HIPAA-compliant enclave. We estimated correlation between true PHQ-9 and model-estimated score and examined the predictive performance of the model for moderate or greater depressive symptoms.</div></div><div><h3>Results</h3><div>Mean age was 46.3 years (SD 14.9); 9083 (60.6 %) identified as female. 925 (6.2 %) identified as Asian, 638 (4.3 %) as Black, 853 (5.7 %) as another race, and 12,187 (81.2 %) as White. A total of 1044 (7.0 %) identified as Hispanic ethnicity, while 12,699 (84.7 %) were non-Hispanic. Mean measured PHQ-9 score was 1.23 (SD 3.45); 721 (4.8 %) met criteria for moderate or greater depressive symptoms. LLM-predicted PHQ-9 scores were modestly correlated with actual scores (r<sup>2</sup> = 0.264 (95 % CI 0.252–0.276)); PPV for moderate or greater depression was 0.309 (95 % CI 0.302–0.317). Performance was consistent across demographic subgroups, with modest differences identified by race, ethnicity, and sex.</div></div><div><h3>Conclusion</h3><div>A foundational LLM performed poorly but consistently across subgroups in imputing PHQ-9 scores from notes when actual PHQ-9 reporting was ablated. This result suggests the extent to which inclusion of PROMs may impoverish documentation of psychiatric symptoms.</div></div>\",\"PeriodicalId\":14963,\"journal\":{\"name\":\"Journal of affective disorders\",\"volume\":\"381 \",\"pages\":\"Pages 270-274\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of affective disorders\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S016503272500566X\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CLINICAL NEUROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of affective disorders","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016503272500566X","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0

摘要

背景:抑郁症治疗指南强调基于测量的护理,使用患者报告的结果测量,但其对叙事文献质量的影响仍未得到充分探讨。方法:从某大型学术医疗中心的电子健康记录中抽取18000份叙述性临床门诊记录,反映2019年1月2日至2024年1月30日的就诊情况,同时填写9项患者健康问卷(PHQ-9)。在从笔记中筛选PHQ-9分数后,我们在符合hipaa的飞地中使用基础大语言模型(gpt40 -08-06)估计抑郁症状的严重程度。我们估计了真实PHQ-9和模型估计得分之间的相关性,并检验了模型对中度或重度抑郁症状的预测性能。结果:平均年龄46.3 岁(SD 14.9);9083(60.6 %)被确定为女性。925人(6.2 %)被认为是亚洲人,638人(4.3 %)被认为是黑人,840人(4.7 %)被认为是其他种族,12187人(81.2 %)被认为是白人。共有1044人(7.0 %)被确定为西班牙裔,而12699人(84.7 %)是非西班牙裔。PHQ-9平均分为1.23 (SD 3.45);721例(4.8 %)符合中度或重度抑郁症状的标准。llm预测的PHQ-9得分与实际得分呈正相关(r2 = 0.264(95 % CI 0.252-0.276));中度或重度抑郁症的PPV为0.309(95 % CI 0.302-0.317)。在人口统计亚组中,表现是一致的,在种族、民族和性别方面存在适度差异。结论:当实际PHQ-9报告被删除时,基础LLM在从笔记中计算PHQ-9分数方面表现不佳,但在亚组中表现一致。这一结果表明,在何种程度上,包括PROMs可能会使精神症状的记录变得贫乏。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Estimating depression severity in narrative clinical notes using large language models

Background

Depression treatment guidelines emphasize measurement-based care using patient-reported outcome measures, yet their impact on narrative documentation quality remains underexplored.

Methods

We sampled 15,000 narrative clinical outpatient notes from the electronic health record of a large academic medical center, reflecting visits between January 2, 2019 and January 30, 2024, for which a 9-item Patient Health Questionnaire (PHQ-9) was completed at the same time. After censoring PHQ-9 scores from notes, we estimated severity of depressive symptoms with a foundational large language model (gpt4o-08-06) in a HIPAA-compliant enclave. We estimated correlation between true PHQ-9 and model-estimated score and examined the predictive performance of the model for moderate or greater depressive symptoms.

Results

Mean age was 46.3 years (SD 14.9); 9083 (60.6 %) identified as female. 925 (6.2 %) identified as Asian, 638 (4.3 %) as Black, 853 (5.7 %) as another race, and 12,187 (81.2 %) as White. A total of 1044 (7.0 %) identified as Hispanic ethnicity, while 12,699 (84.7 %) were non-Hispanic. Mean measured PHQ-9 score was 1.23 (SD 3.45); 721 (4.8 %) met criteria for moderate or greater depressive symptoms. LLM-predicted PHQ-9 scores were modestly correlated with actual scores (r2 = 0.264 (95 % CI 0.252–0.276)); PPV for moderate or greater depression was 0.309 (95 % CI 0.302–0.317). Performance was consistent across demographic subgroups, with modest differences identified by race, ethnicity, and sex.

Conclusion

A foundational LLM performed poorly but consistently across subgroups in imputing PHQ-9 scores from notes when actual PHQ-9 reporting was ablated. This result suggests the extent to which inclusion of PROMs may impoverish documentation of psychiatric symptoms.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of affective disorders
Journal of affective disorders 医学-精神病学
CiteScore
10.90
自引率
6.10%
发文量
1319
审稿时长
9.3 weeks
期刊介绍: The Journal of Affective Disorders publishes papers concerned with affective disorders in the widest sense: depression, mania, mood spectrum, emotions and personality, anxiety and stress. It is interdisciplinary and aims to bring together different approaches for a diverse readership. Top quality papers will be accepted dealing with any aspect of affective disorders, including neuroimaging, cognitive neurosciences, genetics, molecular biology, experimental and clinical neurosciences, pharmacology, neuroimmunoendocrinology, intervention and treatment trials.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信