使用大型语言模型估计记叙性临床记录中的抑郁严重程度。

IF 4.9 2区医学 Q1 CLINICAL NEUROLOGY

Journal of affective disorders Pub Date : 2025-04-03 DOI:10.1016/j.jad.2025.04.014

Thomas H. McCoy, Victor M. Castro, Roy H. Perlis

{"title":"使用大型语言模型估计记叙性临床记录中的抑郁严重程度。","authors":"Thomas H. McCoy, Victor M. Castro, Roy H. Perlis","doi":"10.1016/j.jad.2025.04.014","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Depression treatment guidelines emphasize measurement-based care using patient-reported outcome measures, yet their impact on narrative documentation quality remains underexplored.</div></div><div><h3>Methods</h3><div>We sampled 15,000 narrative clinical outpatient notes from the electronic health record of a large academic medical center, reflecting visits between January 2, 2019 and January 30, 2024, for which a 9-item Patient Health Questionnaire (PHQ-9) was completed at the same time. After censoring PHQ-9 scores from notes, we estimated severity of depressive symptoms with a foundational large language model (gpt4o-08-06) in a HIPAA-compliant enclave. We estimated correlation between true PHQ-9 and model-estimated score and examined the predictive performance of the model for moderate or greater depressive symptoms.</div></div><div><h3>Results</h3><div>Mean age was 46.3 years (SD 14.9); 9083 (60.6 %) identified as female. 925 (6.2 %) identified as Asian, 638 (4.3 %) as Black, 853 (5.7 %) as another race, and 12,187 (81.2 %) as White. A total of 1044 (7.0 %) identified as Hispanic ethnicity, while 12,699 (84.7 %) were non-Hispanic. Mean measured PHQ-9 score was 1.23 (SD 3.45); 721 (4.8 %) met criteria for moderate or greater depressive symptoms. LLM-predicted PHQ-9 scores were modestly correlated with actual scores (r<sup>2</sup> = 0.264 (95 % CI 0.252–0.276)); PPV for moderate or greater depression was 0.309 (95 % CI 0.302–0.317). Performance was consistent across demographic subgroups, with modest differences identified by race, ethnicity, and sex.</div></div><div><h3>Conclusion</h3><div>A foundational LLM performed poorly but consistently across subgroups in imputing PHQ-9 scores from notes when actual PHQ-9 reporting was ablated. This result suggests the extent to which inclusion of PROMs may impoverish documentation of psychiatric symptoms.</div></div>","PeriodicalId":14963,"journal":{"name":"Journal of affective disorders","volume":"381 ","pages":"Pages 270-274"},"PeriodicalIF":4.9000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Estimating depression severity in narrative clinical notes using large language models\",\"authors\":\"Thomas H. McCoy, Victor M. Castro, Roy H. Perlis\",\"doi\":\"10.1016/j.jad.2025.04.014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Depression treatment guidelines emphasize measurement-based care using patient-reported outcome measures, yet their impact on narrative documentation quality remains underexplored.</div></div><div><h3>Methods</h3><div>We sampled 15,000 narrative clinical outpatient notes from the electronic health record of a large academic medical center, reflecting visits between January 2, 2019 and January 30, 2024, for which a 9-item Patient Health Questionnaire (PHQ-9) was completed at the same time. After censoring PHQ-9 scores from notes, we estimated severity of depressive symptoms with a foundational large language model (gpt4o-08-06) in a HIPAA-compliant enclave. We estimated correlation between true PHQ-9 and model-estimated score and examined the predictive performance of the model for moderate or greater depressive symptoms.</div></div><div><h3>Results</h3><div>Mean age was 46.3 years (SD 14.9); 9083 (60.6 %) identified as female. 925 (6.2 %) identified as Asian, 638 (4.3 %) as Black, 853 (5.7 %) as another race, and 12,187 (81.2 %) as White. A total of 1044 (7.0 %) identified as Hispanic ethnicity, while 12,699 (84.7 %) were non-Hispanic. Mean measured PHQ-9 score was 1.23 (SD 3.45); 721 (4.8 %) met criteria for moderate or greater depressive symptoms. LLM-predicted PHQ-9 scores were modestly correlated with actual scores (r<sup>2</sup> = 0.264 (95 % CI 0.252–0.276)); PPV for moderate or greater depression was 0.309 (95 % CI 0.302–0.317). Performance was consistent across demographic subgroups, with modest differences identified by race, ethnicity, and sex.</div></div><div><h3>Conclusion</h3><div>A foundational LLM performed poorly but consistently across subgroups in imputing PHQ-9 scores from notes when actual PHQ-9 reporting was ablated. This result suggests the extent to which inclusion of PROMs may impoverish documentation of psychiatric symptoms.</div></div>\",\"PeriodicalId\":14963,\"journal\":{\"name\":\"Journal of affective disorders\",\"volume\":\"381 \",\"pages\":\"Pages 270-274\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of affective disorders\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S016503272500566X\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CLINICAL NEUROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of affective disorders","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016503272500566X","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

摘要

背景：抑郁症治疗指南强调基于测量的护理，使用患者报告的结果测量，但其对叙事文献质量的影响仍未得到充分探讨。方法：从某大型学术医疗中心的电子健康记录中抽取18000份叙述性临床门诊记录，反映2019年1月2日至2024年1月30日的就诊情况，同时填写9项患者健康问卷（PHQ-9）。在从笔记中筛选PHQ-9分数后，我们在符合hipaa的飞地中使用基础大语言模型（gpt40 -08-06）估计抑郁症状的严重程度。我们估计了真实PHQ-9和模型估计得分之间的相关性，并检验了模型对中度或重度抑郁症状的预测性能。结果：平均年龄46.3 岁（SD 14.9）；9083（60.6 %）被确定为女性。925人（6.2 %）被认为是亚洲人，638人（4.3 %）被认为是黑人，840人（4.7 %）被认为是其他种族，12187人（81.2 %）被认为是白人。共有1044人（7.0 %）被确定为西班牙裔，而12699人（84.7 %）是非西班牙裔。PHQ-9平均分为1.23 (SD 3.45)；721例（4.8 %）符合中度或重度抑郁症状的标准。llm预测的PHQ-9得分与实际得分呈正相关（r2 = 0.264(95 % CI 0.252-0.276)）；中度或重度抑郁症的PPV为0.309（95 % CI 0.302-0.317）。在人口统计亚组中，表现是一致的，在种族、民族和性别方面存在适度差异。结论：当实际PHQ-9报告被删除时，基础LLM在从笔记中计算PHQ-9分数方面表现不佳，但在亚组中表现一致。这一结果表明，在何种程度上，包括PROMs可能会使精神症状的记录变得贫乏。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Estimating depression severity in narrative clinical notes using large language models

Background

Depression treatment guidelines emphasize measurement-based care using patient-reported outcome measures, yet their impact on narrative documentation quality remains underexplored.

Methods

We sampled 15,000 narrative clinical outpatient notes from the electronic health record of a large academic medical center, reflecting visits between January 2, 2019 and January 30, 2024, for which a 9-item Patient Health Questionnaire (PHQ-9) was completed at the same time. After censoring PHQ-9 scores from notes, we estimated severity of depressive symptoms with a foundational large language model (gpt4o-08-06) in a HIPAA-compliant enclave. We estimated correlation between true PHQ-9 and model-estimated score and examined the predictive performance of the model for moderate or greater depressive symptoms.

Results

Mean age was 46.3 years (SD 14.9); 9083 (60.6 %) identified as female. 925 (6.2 %) identified as Asian, 638 (4.3 %) as Black, 853 (5.7 %) as another race, and 12,187 (81.2 %) as White. A total of 1044 (7.0 %) identified as Hispanic ethnicity, while 12,699 (84.7 %) were non-Hispanic. Mean measured PHQ-9 score was 1.23 (SD 3.45); 721 (4.8 %) met criteria for moderate or greater depressive symptoms. LLM-predicted PHQ-9 scores were modestly correlated with actual scores (r² = 0.264 (95 % CI 0.252–0.276)); PPV for moderate or greater depression was 0.309 (95 % CI 0.302–0.317). Performance was consistent across demographic subgroups, with modest differences identified by race, ethnicity, and sex.

Conclusion

A foundational LLM performed poorly but consistently across subgroups in imputing PHQ-9 scores from notes when actual PHQ-9 reporting was ablated. This result suggests the extent to which inclusion of PROMs may impoverish documentation of psychiatric symptoms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of affective disorders 医学-精神病学

CiteScore

10.90

自引率

6.10%

发文量

1319

审稿时长

9.3 weeks

期刊介绍： The Journal of Affective Disorders publishes papers concerned with affective disorders in the widest sense: depression, mania, mood spectrum, emotions and personality, anxiety and stress. It is interdisciplinary and aims to bring together different approaches for a diverse readership. Top quality papers will be accepted dealing with any aspect of affective disorders, including neuroimaging, cognitive neurosciences, genetics, molecular biology, experimental and clinical neurosciences, pharmacology, neuroimmunoendocrinology, intervention and treatment trials.