在测量记录的护理目标讨论中的零射击大语言模型的评估。

IF 3.5 2区 医学 Q2 CLINICAL NEUROLOGY
Robert Y Lee, Kevin S Li, James Sibley, Trevor Cohen, William B Lober, Danae G Dotolo, Erin K Kross
{"title":"在测量记录的护理目标讨论中的零射击大语言模型的评估。","authors":"Robert Y Lee, Kevin S Li, James Sibley, Trevor Cohen, William B Lober, Danae G Dotolo, Erin K Kross","doi":"10.1016/j.jpainsymman.2025.09.025","DOIUrl":null,"url":null,"abstract":"<p><strong>Context: </strong>Goals-of-care (GOC) discussions and their documentation are important process measures in palliative care. However, existing natural language processing (NLP) models for identifying such documentation require costly task-specific training data. Large language models (LLMs) hold promise for measuring such constructs with fewer or no task-specific training data.</p><p><strong>Objective: </strong>To evaluate the performance of a publicly available LLM with no task-specific training data (zero-shot prompting) for identifying documented GOC discussions.</p><p><strong>Methods: </strong>We compared performance of two NLP models in identifying documented GOC discussions: Llama 3.3 using zero-shot prompting; and, a task-specific BERT (Bidirectional Encoder Representations from Transformers)-based model trained on 4,642 manually annotated notes. We tested both models on records from a series of clinical trials enrolling adult patients with chronic life-limiting illness hospitalized over 2018-2023. We evaluated the area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (AUPRC), and maximal F<sub>1</sub> score, for both note-level and patient-level classification over a 30-day period.</p><p><strong>Results: </strong>In our text corpora, GOC documentation represented <1% of text and was found in 7.3-9.9% of notes for 23-37% of patients. In a 617-patient held-out test set, Llama 3.3 (zero-shot) and BERT (task-specific, trained) exhibited comparable performance in identifying GOC documentation (Llama 3.3: AUC 0.979, AUPRC 0.873, and F<sub>1</sub> 0.83; BERT: AUC 0.981, AUPRC 0.874, and F<sub>1</sub> 0.83).</p><p><strong>Conclusion: </strong>A zero-shot large language model with no task-specific training performed similarly to a task-specific trained BERT model in identifying documented goals-of-care discussions. This demonstrates the promise of LLMs in measuring novel clinical research outcomes.</p>","PeriodicalId":16634,"journal":{"name":"Journal of pain and symptom management","volume":" ","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Assessment of a zero-shot large language model in measuring documented goals-of-care discussions.\",\"authors\":\"Robert Y Lee, Kevin S Li, James Sibley, Trevor Cohen, William B Lober, Danae G Dotolo, Erin K Kross\",\"doi\":\"10.1016/j.jpainsymman.2025.09.025\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Context: </strong>Goals-of-care (GOC) discussions and their documentation are important process measures in palliative care. However, existing natural language processing (NLP) models for identifying such documentation require costly task-specific training data. Large language models (LLMs) hold promise for measuring such constructs with fewer or no task-specific training data.</p><p><strong>Objective: </strong>To evaluate the performance of a publicly available LLM with no task-specific training data (zero-shot prompting) for identifying documented GOC discussions.</p><p><strong>Methods: </strong>We compared performance of two NLP models in identifying documented GOC discussions: Llama 3.3 using zero-shot prompting; and, a task-specific BERT (Bidirectional Encoder Representations from Transformers)-based model trained on 4,642 manually annotated notes. We tested both models on records from a series of clinical trials enrolling adult patients with chronic life-limiting illness hospitalized over 2018-2023. We evaluated the area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (AUPRC), and maximal F<sub>1</sub> score, for both note-level and patient-level classification over a 30-day period.</p><p><strong>Results: </strong>In our text corpora, GOC documentation represented <1% of text and was found in 7.3-9.9% of notes for 23-37% of patients. In a 617-patient held-out test set, Llama 3.3 (zero-shot) and BERT (task-specific, trained) exhibited comparable performance in identifying GOC documentation (Llama 3.3: AUC 0.979, AUPRC 0.873, and F<sub>1</sub> 0.83; BERT: AUC 0.981, AUPRC 0.874, and F<sub>1</sub> 0.83).</p><p><strong>Conclusion: </strong>A zero-shot large language model with no task-specific training performed similarly to a task-specific trained BERT model in identifying documented goals-of-care discussions. This demonstrates the promise of LLMs in measuring novel clinical research outcomes.</p>\",\"PeriodicalId\":16634,\"journal\":{\"name\":\"Journal of pain and symptom management\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of pain and symptom management\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.jpainsymman.2025.09.025\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CLINICAL NEUROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of pain and symptom management","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jpainsymman.2025.09.025","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0

摘要

背景:护理目标(GOC)的讨论和他们的文件是姑息治疗的重要过程措施。然而,现有的用于识别此类文档的自然语言处理(NLP)模型需要昂贵的特定于任务的训练数据。大型语言模型(llm)有望使用更少或没有特定于任务的训练数据来度量这些结构。目的:评估没有特定任务训练数据(零射击提示)的公开可用LLM的性能,以识别记录的GOC讨论。方法:我们比较了两种NLP模型在识别GOC讨论记录方面的性能:Llama 3.3使用零射击提示;以及一个基于任务特定的BERT(来自变形金刚的双向编码器表示)的模型,该模型训练了4,642个手动注释的注释。我们根据2018-2023年住院的慢性限制生命疾病成年患者的一系列临床试验记录对这两种模型进行了测试。在30天的时间里,我们评估了受试者工作特征曲线下的面积(AUC)、精确召回率曲线下的面积(AUPRC)和最大F1评分,用于笔记水平和患者水平的分类。结果:在我们的文本语料库中,GOC文档代表1 0.83;BERT: AUC 0.981, AUPRC 0.874, F1 0.83)。结论:没有特定任务训练的零射击大型语言模型在识别记录的护理目标讨论方面的表现与特定任务训练的BERT模型相似。这证明了法学硕士在衡量新的临床研究成果方面的前景。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Assessment of a zero-shot large language model in measuring documented goals-of-care discussions.

Context: Goals-of-care (GOC) discussions and their documentation are important process measures in palliative care. However, existing natural language processing (NLP) models for identifying such documentation require costly task-specific training data. Large language models (LLMs) hold promise for measuring such constructs with fewer or no task-specific training data.

Objective: To evaluate the performance of a publicly available LLM with no task-specific training data (zero-shot prompting) for identifying documented GOC discussions.

Methods: We compared performance of two NLP models in identifying documented GOC discussions: Llama 3.3 using zero-shot prompting; and, a task-specific BERT (Bidirectional Encoder Representations from Transformers)-based model trained on 4,642 manually annotated notes. We tested both models on records from a series of clinical trials enrolling adult patients with chronic life-limiting illness hospitalized over 2018-2023. We evaluated the area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (AUPRC), and maximal F1 score, for both note-level and patient-level classification over a 30-day period.

Results: In our text corpora, GOC documentation represented <1% of text and was found in 7.3-9.9% of notes for 23-37% of patients. In a 617-patient held-out test set, Llama 3.3 (zero-shot) and BERT (task-specific, trained) exhibited comparable performance in identifying GOC documentation (Llama 3.3: AUC 0.979, AUPRC 0.873, and F1 0.83; BERT: AUC 0.981, AUPRC 0.874, and F1 0.83).

Conclusion: A zero-shot large language model with no task-specific training performed similarly to a task-specific trained BERT model in identifying documented goals-of-care discussions. This demonstrates the promise of LLMs in measuring novel clinical research outcomes.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
8.90
自引率
6.40%
发文量
821
审稿时长
26 days
期刊介绍: The Journal of Pain and Symptom Management is an internationally respected, peer-reviewed journal and serves an interdisciplinary audience of professionals by providing a forum for the publication of the latest clinical research and best practices related to the relief of illness burden among patients afflicted with serious or life-threatening illness.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信