Robert Y Lee, Kevin S Li, James Sibley, Trevor Cohen, William B Lober, Danae G Dotolo, Erin K Kross
{"title":"Assessment of a zero-shot large language model in measuring documented goals-of-care discussions.","authors":"Robert Y Lee, Kevin S Li, James Sibley, Trevor Cohen, William B Lober, Danae G Dotolo, Erin K Kross","doi":"10.1016/j.jpainsymman.2025.09.025","DOIUrl":null,"url":null,"abstract":"<p><strong>Context: </strong>Goals-of-care (GOC) discussions and their documentation are important process measures in palliative care. However, existing natural language processing (NLP) models for identifying such documentation require costly task-specific training data. Large language models (LLMs) hold promise for measuring such constructs with fewer or no task-specific training data.</p><p><strong>Objective: </strong>To evaluate the performance of a publicly available LLM with no task-specific training data (zero-shot prompting) for identifying documented GOC discussions.</p><p><strong>Methods: </strong>We compared performance of two NLP models in identifying documented GOC discussions: Llama 3.3 using zero-shot prompting; and, a task-specific BERT (Bidirectional Encoder Representations from Transformers)-based model trained on 4,642 manually annotated notes. We tested both models on records from a series of clinical trials enrolling adult patients with chronic life-limiting illness hospitalized over 2018-2023. We evaluated the area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (AUPRC), and maximal F<sub>1</sub> score, for both note-level and patient-level classification over a 30-day period.</p><p><strong>Results: </strong>In our text corpora, GOC documentation represented <1% of text and was found in 7.3-9.9% of notes for 23-37% of patients. In a 617-patient held-out test set, Llama 3.3 (zero-shot) and BERT (task-specific, trained) exhibited comparable performance in identifying GOC documentation (Llama 3.3: AUC 0.979, AUPRC 0.873, and F<sub>1</sub> 0.83; BERT: AUC 0.981, AUPRC 0.874, and F<sub>1</sub> 0.83).</p><p><strong>Conclusion: </strong>A zero-shot large language model with no task-specific training performed similarly to a task-specific trained BERT model in identifying documented goals-of-care discussions. This demonstrates the promise of LLMs in measuring novel clinical research outcomes.</p>","PeriodicalId":16634,"journal":{"name":"Journal of pain and symptom management","volume":" ","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of pain and symptom management","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jpainsymman.2025.09.025","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Context: Goals-of-care (GOC) discussions and their documentation are important process measures in palliative care. However, existing natural language processing (NLP) models for identifying such documentation require costly task-specific training data. Large language models (LLMs) hold promise for measuring such constructs with fewer or no task-specific training data.
Objective: To evaluate the performance of a publicly available LLM with no task-specific training data (zero-shot prompting) for identifying documented GOC discussions.
Methods: We compared performance of two NLP models in identifying documented GOC discussions: Llama 3.3 using zero-shot prompting; and, a task-specific BERT (Bidirectional Encoder Representations from Transformers)-based model trained on 4,642 manually annotated notes. We tested both models on records from a series of clinical trials enrolling adult patients with chronic life-limiting illness hospitalized over 2018-2023. We evaluated the area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (AUPRC), and maximal F1 score, for both note-level and patient-level classification over a 30-day period.
Results: In our text corpora, GOC documentation represented <1% of text and was found in 7.3-9.9% of notes for 23-37% of patients. In a 617-patient held-out test set, Llama 3.3 (zero-shot) and BERT (task-specific, trained) exhibited comparable performance in identifying GOC documentation (Llama 3.3: AUC 0.979, AUPRC 0.873, and F1 0.83; BERT: AUC 0.981, AUPRC 0.874, and F1 0.83).
Conclusion: A zero-shot large language model with no task-specific training performed similarly to a task-specific trained BERT model in identifying documented goals-of-care discussions. This demonstrates the promise of LLMs in measuring novel clinical research outcomes.
期刊介绍:
The Journal of Pain and Symptom Management is an internationally respected, peer-reviewed journal and serves an interdisciplinary audience of professionals by providing a forum for the publication of the latest clinical research and best practices related to the relief of illness burden among patients afflicted with serious or life-threatening illness.