{"title":"使用大型语言模型估计记叙性临床记录中的抑郁严重程度。","authors":"Thomas H. McCoy, Victor M. Castro, Roy H. Perlis","doi":"10.1016/j.jad.2025.04.014","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Depression treatment guidelines emphasize measurement-based care using patient-reported outcome measures, yet their impact on narrative documentation quality remains underexplored.</div></div><div><h3>Methods</h3><div>We sampled 15,000 narrative clinical outpatient notes from the electronic health record of a large academic medical center, reflecting visits between January 2, 2019 and January 30, 2024, for which a 9-item Patient Health Questionnaire (PHQ-9) was completed at the same time. After censoring PHQ-9 scores from notes, we estimated severity of depressive symptoms with a foundational large language model (gpt4o-08-06) in a HIPAA-compliant enclave. We estimated correlation between true PHQ-9 and model-estimated score and examined the predictive performance of the model for moderate or greater depressive symptoms.</div></div><div><h3>Results</h3><div>Mean age was 46.3 years (SD 14.9); 9083 (60.6 %) identified as female. 925 (6.2 %) identified as Asian, 638 (4.3 %) as Black, 853 (5.7 %) as another race, and 12,187 (81.2 %) as White. A total of 1044 (7.0 %) identified as Hispanic ethnicity, while 12,699 (84.7 %) were non-Hispanic. Mean measured PHQ-9 score was 1.23 (SD 3.45); 721 (4.8 %) met criteria for moderate or greater depressive symptoms. LLM-predicted PHQ-9 scores were modestly correlated with actual scores (r<sup>2</sup> = 0.264 (95 % CI 0.252–0.276)); PPV for moderate or greater depression was 0.309 (95 % CI 0.302–0.317). Performance was consistent across demographic subgroups, with modest differences identified by race, ethnicity, and sex.</div></div><div><h3>Conclusion</h3><div>A foundational LLM performed poorly but consistently across subgroups in imputing PHQ-9 scores from notes when actual PHQ-9 reporting was ablated. This result suggests the extent to which inclusion of PROMs may impoverish documentation of psychiatric symptoms.</div></div>","PeriodicalId":14963,"journal":{"name":"Journal of affective disorders","volume":"381 ","pages":"Pages 270-274"},"PeriodicalIF":4.9000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Estimating depression severity in narrative clinical notes using large language models\",\"authors\":\"Thomas H. McCoy, Victor M. Castro, Roy H. Perlis\",\"doi\":\"10.1016/j.jad.2025.04.014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Depression treatment guidelines emphasize measurement-based care using patient-reported outcome measures, yet their impact on narrative documentation quality remains underexplored.</div></div><div><h3>Methods</h3><div>We sampled 15,000 narrative clinical outpatient notes from the electronic health record of a large academic medical center, reflecting visits between January 2, 2019 and January 30, 2024, for which a 9-item Patient Health Questionnaire (PHQ-9) was completed at the same time. After censoring PHQ-9 scores from notes, we estimated severity of depressive symptoms with a foundational large language model (gpt4o-08-06) in a HIPAA-compliant enclave. We estimated correlation between true PHQ-9 and model-estimated score and examined the predictive performance of the model for moderate or greater depressive symptoms.</div></div><div><h3>Results</h3><div>Mean age was 46.3 years (SD 14.9); 9083 (60.6 %) identified as female. 925 (6.2 %) identified as Asian, 638 (4.3 %) as Black, 853 (5.7 %) as another race, and 12,187 (81.2 %) as White. A total of 1044 (7.0 %) identified as Hispanic ethnicity, while 12,699 (84.7 %) were non-Hispanic. Mean measured PHQ-9 score was 1.23 (SD 3.45); 721 (4.8 %) met criteria for moderate or greater depressive symptoms. LLM-predicted PHQ-9 scores were modestly correlated with actual scores (r<sup>2</sup> = 0.264 (95 % CI 0.252–0.276)); PPV for moderate or greater depression was 0.309 (95 % CI 0.302–0.317). Performance was consistent across demographic subgroups, with modest differences identified by race, ethnicity, and sex.</div></div><div><h3>Conclusion</h3><div>A foundational LLM performed poorly but consistently across subgroups in imputing PHQ-9 scores from notes when actual PHQ-9 reporting was ablated. This result suggests the extent to which inclusion of PROMs may impoverish documentation of psychiatric symptoms.</div></div>\",\"PeriodicalId\":14963,\"journal\":{\"name\":\"Journal of affective disorders\",\"volume\":\"381 \",\"pages\":\"Pages 270-274\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of affective disorders\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S016503272500566X\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CLINICAL NEUROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of affective disorders","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016503272500566X","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
Estimating depression severity in narrative clinical notes using large language models
Background
Depression treatment guidelines emphasize measurement-based care using patient-reported outcome measures, yet their impact on narrative documentation quality remains underexplored.
Methods
We sampled 15,000 narrative clinical outpatient notes from the electronic health record of a large academic medical center, reflecting visits between January 2, 2019 and January 30, 2024, for which a 9-item Patient Health Questionnaire (PHQ-9) was completed at the same time. After censoring PHQ-9 scores from notes, we estimated severity of depressive symptoms with a foundational large language model (gpt4o-08-06) in a HIPAA-compliant enclave. We estimated correlation between true PHQ-9 and model-estimated score and examined the predictive performance of the model for moderate or greater depressive symptoms.
Results
Mean age was 46.3 years (SD 14.9); 9083 (60.6 %) identified as female. 925 (6.2 %) identified as Asian, 638 (4.3 %) as Black, 853 (5.7 %) as another race, and 12,187 (81.2 %) as White. A total of 1044 (7.0 %) identified as Hispanic ethnicity, while 12,699 (84.7 %) were non-Hispanic. Mean measured PHQ-9 score was 1.23 (SD 3.45); 721 (4.8 %) met criteria for moderate or greater depressive symptoms. LLM-predicted PHQ-9 scores were modestly correlated with actual scores (r2 = 0.264 (95 % CI 0.252–0.276)); PPV for moderate or greater depression was 0.309 (95 % CI 0.302–0.317). Performance was consistent across demographic subgroups, with modest differences identified by race, ethnicity, and sex.
Conclusion
A foundational LLM performed poorly but consistently across subgroups in imputing PHQ-9 scores from notes when actual PHQ-9 reporting was ablated. This result suggests the extent to which inclusion of PROMs may impoverish documentation of psychiatric symptoms.
期刊介绍:
The Journal of Affective Disorders publishes papers concerned with affective disorders in the widest sense: depression, mania, mood spectrum, emotions and personality, anxiety and stress. It is interdisciplinary and aims to bring together different approaches for a diverse readership. Top quality papers will be accepted dealing with any aspect of affective disorders, including neuroimaging, cognitive neurosciences, genetics, molecular biology, experimental and clinical neurosciences, pharmacology, neuroimmunoendocrinology, intervention and treatment trials.