Assessment of a zero-shot large language model in measuring documented goals-of-care discussions.

IF 3.5 2区医学 Q2 CLINICAL NEUROLOGY

Journal of pain and symptom management Pub Date : 2025-10-06 DOI:10.1016/j.jpainsymman.2025.09.025

Robert Y Lee, Kevin S Li, James Sibley, Trevor Cohen, William B Lober, Danae G Dotolo, Erin K Kross

{"title":"Assessment of a zero-shot large language model in measuring documented goals-of-care discussions.","authors":"Robert Y Lee, Kevin S Li, James Sibley, Trevor Cohen, William B Lober, Danae G Dotolo, Erin K Kross","doi":"10.1016/j.jpainsymman.2025.09.025","DOIUrl":null,"url":null,"abstract":"Context: Goals-of-care (GOC) discussions and their documentation are important process measures in palliative care. However, existing natural language processing (NLP) models for identifying such documentation require costly task-specific training data. Large language models (LLMs) hold promise for measuring such constructs with fewer or no task-specific training data.Objective: To evaluate the performance of a publicly available LLM with no task-specific training data (zero-shot prompting) for identifying documented GOC discussions.Methods: We compared performance of two NLP models in identifying documented GOC discussions: Llama 3.3 using zero-shot prompting; and, a task-specific BERT (Bidirectional Encoder Representations from Transformers)-based model trained on 4,642 manually annotated notes. We tested both models on records from a series of clinical trials enrolling adult patients with chronic life-limiting illness hospitalized over 2018-2023. We evaluated the area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (AUPRC), and maximal F1 score, for both note-level and patient-level classification over a 30-day period.Results: In our text corpora, GOC documentation represented <1% of text and was found in 7.3-9.9% of notes for 23-37% of patients. In a 617-patient held-out test set, Llama 3.3 (zero-shot) and BERT (task-specific, trained) exhibited comparable performance in identifying GOC documentation (Llama 3.3: AUC 0.979, AUPRC 0.873, and F1 0.83; BERT: AUC 0.981, AUPRC 0.874, and F1 0.83).Conclusion: A zero-shot large language model with no task-specific training performed similarly to a task-specific trained BERT model in identifying documented goals-of-care discussions. This demonstrates the promise of LLMs in measuring novel clinical research outcomes.","PeriodicalId":16634,"journal":{"name":"Journal of pain and symptom management","volume":" ","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of pain and symptom management","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jpainsymman.2025.09.025","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Context: Goals-of-care (GOC) discussions and their documentation are important process measures in palliative care. However, existing natural language processing (NLP) models for identifying such documentation require costly task-specific training data. Large language models (LLMs) hold promise for measuring such constructs with fewer or no task-specific training data.

Objective: To evaluate the performance of a publicly available LLM with no task-specific training data (zero-shot prompting) for identifying documented GOC discussions.

Methods: We compared performance of two NLP models in identifying documented GOC discussions: Llama 3.3 using zero-shot prompting; and, a task-specific BERT (Bidirectional Encoder Representations from Transformers)-based model trained on 4,642 manually annotated notes. We tested both models on records from a series of clinical trials enrolling adult patients with chronic life-limiting illness hospitalized over 2018-2023. We evaluated the area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (AUPRC), and maximal F₁ score, for both note-level and patient-level classification over a 30-day period.

Results: In our text corpora, GOC documentation represented <1% of text and was found in 7.3-9.9% of notes for 23-37% of patients. In a 617-patient held-out test set, Llama 3.3 (zero-shot) and BERT (task-specific, trained) exhibited comparable performance in identifying GOC documentation (Llama 3.3: AUC 0.979, AUPRC 0.873, and F₁ 0.83; BERT: AUC 0.981, AUPRC 0.874, and F₁ 0.83).

Conclusion: A zero-shot large language model with no task-specific training performed similarly to a task-specific trained BERT model in identifying documented goals-of-care discussions. This demonstrates the promise of LLMs in measuring novel clinical research outcomes.

查看原文本刊更多论文

在测量记录的护理目标讨论中的零射击大语言模型的评估。

背景：护理目标（GOC）的讨论和他们的文件是姑息治疗的重要过程措施。然而，现有的用于识别此类文档的自然语言处理（NLP）模型需要昂贵的特定于任务的训练数据。大型语言模型（llm）有望使用更少或没有特定于任务的训练数据来度量这些结构。目的：评估没有特定任务训练数据（零射击提示）的公开可用LLM的性能，以识别记录的GOC讨论。方法：我们比较了两种NLP模型在识别GOC讨论记录方面的性能：Llama 3.3使用零射击提示；以及一个基于任务特定的BERT（来自变形金刚的双向编码器表示）的模型，该模型训练了4,642个手动注释的注释。我们根据2018-2023年住院的慢性限制生命疾病成年患者的一系列临床试验记录对这两种模型进行了测试。在30天的时间里，我们评估了受试者工作特征曲线下的面积（AUC）、精确召回率曲线下的面积（AUPRC）和最大F1评分，用于笔记水平和患者水平的分类。结果：在我们的文本语料库中，GOC文档代表1 0.83；BERT: AUC 0.981, AUPRC 0.874, F1 0.83)。结论：没有特定任务训练的零射击大型语言模型在识别记录的护理目标讨论方面的表现与特定任务训练的BERT模型相似。这证明了法学硕士在衡量新的临床研究成果方面的前景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of pain and symptom management 医学-临床神经学

CiteScore

8.90

自引率

6.40%

发文量

821

审稿时长

26 days

期刊介绍： The Journal of Pain and Symptom Management is an internationally respected, peer-reviewed journal and serves an interdisciplinary audience of professionals by providing a forum for the publication of the latest clinical research and best practices related to the relief of illness burden among patients afflicted with serious or life-threatening illness.