Assessment of a zero-shot large language model in measuring documented goals-of-care discussions.

medRxiv : the preprint server for health sciences Pub Date : 2025-09-26 DOI:10.1101/2025.05.23.25328115

Robert Y Lee, Kevin S Li, James Sibley, Trevor Cohen, William B Lober, Danae G Dotolo, Erin K Kross

{"title":"Assessment of a zero-shot large language model in measuring documented goals-of-care discussions.","authors":"Robert Y Lee, Kevin S Li, James Sibley, Trevor Cohen, William B Lober, Danae G Dotolo, Erin K Kross","doi":"10.1101/2025.05.23.25328115","DOIUrl":null,"url":null,"abstract":"Context: Goals-of-care (GOC) discussions and their documentation are important process measures in palliative care. However, existing natural language processing (NLP) models for identifying such documentation require costly task-specific training data. Large language models (LLMs) hold promise for measuring such constructs with fewer or no task-specific training data.Objective: To evaluate the performance of a publicly available LLM with no task-specific training data (zero-shot prompting) for identifying documented GOC discussions.Methods: We compared performance of two NLP models in identifying documented GOC discussions: Llama 3.3 using zero-shot prompting; and, a task-specific BERT (Bidirectional Encoder Representations from Transformers)-based model trained on 4,642 manually annotated notes. We tested both models on records from a series of clinical trials enrolling adult patients with chronic life-limiting illness hospitalized over 2018-2023. We evaluated the area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (AUPRC), and maximal F 1 score, for both note-level and patient-level classification over a 30-day period.Results: In our text corpora, GOC documentation represented <1% of text and was found in 7.3-9.9% of notes for 23-37% of patients. In a 617-patient held-out test set, Llama 3.3 (zero-shot) and BERT (task-specific, trained) exhibited comparable performance in identifying GOC documentation (Llama 3.3: AUC 0.979, AUPRC 0.873, and F 1 0.83; BERT: AUC 0.981, AUPRC 0.874, and F 1 0.83).Conclusion: A zero-shot large language model with no task-specific training performed similarly to a task-specific trained BERT model in identifying documented goals-of-care discussions. This demonstrates the promise of LLMs in measuring novel clinical research outcomes.Key message: This article reports the performance of a publicly available large language model with no task-specific training data in measuring the occurrence of documented goals-of-care discussions from electronic health records. The study demonstrates that newer large language AI models may allow investigators to measure novel outcomes without requiring costly training data.","PeriodicalId":94281,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12140542/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.05.23.25328115","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Context: Goals-of-care (GOC) discussions and their documentation are important process measures in palliative care. However, existing natural language processing (NLP) models for identifying such documentation require costly task-specific training data. Large language models (LLMs) hold promise for measuring such constructs with fewer or no task-specific training data.

Objective: To evaluate the performance of a publicly available LLM with no task-specific training data (zero-shot prompting) for identifying documented GOC discussions.

Methods: We compared performance of two NLP models in identifying documented GOC discussions: Llama 3.3 using zero-shot prompting; and, a task-specific BERT (Bidirectional Encoder Representations from Transformers)-based model trained on 4,642 manually annotated notes. We tested both models on records from a series of clinical trials enrolling adult patients with chronic life-limiting illness hospitalized over 2018-2023. We evaluated the area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (AUPRC), and maximal F ₁ score, for both note-level and patient-level classification over a 30-day period.

Results: In our text corpora, GOC documentation represented <1% of text and was found in 7.3-9.9% of notes for 23-37% of patients. In a 617-patient held-out test set, Llama 3.3 (zero-shot) and BERT (task-specific, trained) exhibited comparable performance in identifying GOC documentation (Llama 3.3: AUC 0.979, AUPRC 0.873, and F ₁ 0.83; BERT: AUC 0.981, AUPRC 0.874, and F ₁ 0.83).

Conclusion: A zero-shot large language model with no task-specific training performed similarly to a task-specific trained BERT model in identifying documented goals-of-care discussions. This demonstrates the promise of LLMs in measuring novel clinical research outcomes.

Key message: This article reports the performance of a publicly available large language model with no task-specific training data in measuring the occurrence of documented goals-of-care discussions from electronic health records. The study demonstrates that newer large language AI models may allow investigators to measure novel outcomes without requiring costly training data.

Abstract Image

查看原文本刊更多论文

在测量记录的护理目标讨论中的零射击大语言模型的评估。

重要性：护理目标（GOC）的讨论和他们的文件是一个重要的过程措施在姑息治疗。然而，用于识别GOC文档的现有自然语言处理（NLP）模型需要昂贵的训练数据，这些数据不能转移到其他感兴趣的结构中。较新的大型语言模型（llm）有望在较少或不需要特定任务训练的情况下测量复杂的语言结构。目的：评估没有特定任务培训数据（零提示）的公开可用LLM在识别ehr记录的GOC讨论方面的表现。设计设置和参与者：本诊断研究比较了两种NLP模型在识别电子健康记录（EHR）记录的GOC讨论方面的性能：Llama 3.3使用零提示，以及基于任务特定的BERT（来自变形金刚的双向编码器表示）的模型，该模型在4,642个手动注释笔记的语料库上训练。使用从临床试验中提取的文本语料库对模型进行评估，这些临床试验招募了2018-2023年在美国卫生系统住院的慢性限制生命的成年患者。结果和测量方法：结果是NLP模型的性能，通过受试者工作特征曲线下面积（AUC）、精确召回率曲线下面积（AUPRC）和最大f1分来评估。在30天的时间里，对NLP的表现进行了笔记水平和患者水平的评估。结果：在三个文本语料库中，GOC文档代表1 0.83；BERT鉴定AUC为0.981，AUPRC为0.874，F为0.83。在检查指定的30天内GOC记录的累积发生率时，Llama 3.3确定GOC记录的患者AUC为0.977，AUPRC为0.955，F为0.89；BERT的AUC为0.981，AUPRC为0.952，F为0.89。结论和相关性：没有特定任务训练的零shot大型语言模型在识别记录的护理目标讨论方面的表现类似于在数千个手动标记的EHR笔记上训练的特定任务监督学习BERT模型。这些发现证明了llm在衡量新的临床试验结果方面的严格应用。问题：新的大型语言人工智能模型能否在没有特定任务训练数据的情况下准确测量记录的护理目标讨论？研究结果：在这项诊断/预后研究中，一个公开可用的大型语言模型提示了结果定义，没有特定任务的训练，与之前的深度学习模型在4,642个注释语料库上训练的深度学习模型在识别记录的护理目标讨论方面表现相当。意义：自然语言处理允许测量以前无法获得的临床研究结果。与传统的自然语言处理和机器学习方法相比，较新的大型语言人工智能模型允许研究人员在不需要昂贵的训练数据的情况下测量新的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

medRxiv : the preprint server for health sciences

自引率

0.00%

发文量