使用大型语言模型来分析临床遇到的症状讨论和建议。

IF 2.1 3区医学 Q2 HEALTH CARE SCIENCES & SERVICES

Journal of palliative medicine Pub Date : 2025-08-06 DOI:10.1177/10966218251363802

Anny T H R Fenton, Natasha Charewycz, Zarwah Kanwal, Brigitte N Durieux, Katherine I Pollack, James A Tulsky, Alexi A Wright, Charlotta J Lindvall

{"title":"使用大型语言模型来分析临床遇到的症状讨论和建议。","authors":"Anny T H R Fenton, Natasha Charewycz, Zarwah Kanwal, Brigitte N Durieux, Katherine I Pollack, James A Tulsky, Alexi A Wright, Charlotta J Lindvall","doi":"10.1177/10966218251363802","DOIUrl":null,"url":null,"abstract":"Background: Patient-provider interactions could inform care quality and communication but are rarely leveraged because collecting and analyzing them is both time-consuming and methodologically complex. The growing availability of large language models (LLMs) makes these analyses more feasible, though their accuracy remains uncertain. Objectives: Assess an LLM's ability to analyze patient-provider interactions. Design: Compare a human's and an LLM's codings of clinical encounter transcripts. Setting/Subjects: Two hundred and thirty-six potential symptom discussions from transcripts of clinical encounters with 92 patients living with cancer in the mid-Atlantic United States. Transcripts were analyzed by GPT4DFCI in our hospital's Health Insurance Portability and Accountability Act compliant infrastructure instance of GPT-4 (OpenAI). Measurements: Human and an LLM-coded transcripts to determine whether a patient's reported symptom(s) were discussed, who initiated the discussion, and any resulting recommendation. We calculated Cohen's κ to assess interrater agreement between the LLM and human and qualitatively classified disagreements about recommendations. Results: Interrater reliability indicated \"strong\" and \"moderate\" agreement levels across measures: Agreement was strongest for whether the symptom was discussed (k = 0.89), followed by who initiated the discussion (k = 0.82), and the recommendation provided (k = 0.78). The human and LLM disagreed on the presence and/or content of the recommendation in 16% of potential discussions, which we categorized into nine types of disagreements. Conclusions: Our results suggest that LLMs' abilities to analyze clinical encounters are equivalent to humans. Thus, using LLMs as a research tool may make it more feasible to analyze patient-provider interactions, which could have broader implications for assessing and improving care quality, care inequities, and provider communication.","PeriodicalId":16656,"journal":{"name":"Journal of palliative medicine","volume":" ","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Using Large Language Models to Analyze Symptom Discussions and Recommendations in Clinical Encounters .\",\"authors\":\"Anny T H R Fenton, Natasha Charewycz, Zarwah Kanwal, Brigitte N Durieux, Katherine I Pollack, James A Tulsky, Alexi A Wright, Charlotta J Lindvall\",\"doi\":\"10.1177/10966218251363802\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Patient-provider interactions could inform care quality and communication but are rarely leveraged because collecting and analyzing them is both time-consuming and methodologically complex. The growing availability of large language models (LLMs) makes these analyses more feasible, though their accuracy remains uncertain. Objectives: Assess an LLM's ability to analyze patient-provider interactions. Design: Compare a human's and an LLM's codings of clinical encounter transcripts. Setting/Subjects: Two hundred and thirty-six potential symptom discussions from transcripts of clinical encounters with 92 patients living with cancer in the mid-Atlantic United States. Transcripts were analyzed by GPT4DFCI in our hospital's Health Insurance Portability and Accountability Act compliant infrastructure instance of GPT-4 (OpenAI). Measurements: Human and an LLM-coded transcripts to determine whether a patient's reported symptom(s) were discussed, who initiated the discussion, and any resulting recommendation. We calculated Cohen's κ to assess interrater agreement between the LLM and human and qualitatively classified disagreements about recommendations. Results: Interrater reliability indicated \\\"strong\\\" and \\\"moderate\\\" agreement levels across measures: Agreement was strongest for whether the symptom was discussed (k = 0.89), followed by who initiated the discussion (k = 0.82), and the recommendation provided (k = 0.78). The human and LLM disagreed on the presence and/or content of the recommendation in 16% of potential discussions, which we categorized into nine types of disagreements. Conclusions: Our results suggest that LLMs' abilities to analyze clinical encounters are equivalent to humans. Thus, using LLMs as a research tool may make it more feasible to analyze patient-provider interactions, which could have broader implications for assessing and improving care quality, care inequities, and provider communication.\",\"PeriodicalId\":16656,\"journal\":{\"name\":\"Journal of palliative medicine\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2025-08-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of palliative medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1177/10966218251363802\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of palliative medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/10966218251363802","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

背景：患者与提供者的互动可以告知护理质量和沟通，但很少被利用，因为收集和分析它们既耗时又方法复杂。大型语言模型（llm）的日益可用性使得这些分析更加可行，尽管它们的准确性仍然不确定。目的：评估法学硕士分析医患互动的能力。设计：比较人类和法学硕士的临床遭遇转录的编码。背景/对象：来自美国大西洋中部地区92名癌症患者临床接触记录的236个潜在症状讨论。采用GPT4DFCI对我院符合《健康保险可移植性与责任法案》的GPT-4 （OpenAI）基础设施实例中的转录本进行分析。测量：人类和llm编码转录本，以确定是否讨论了患者报告的症状，谁发起了讨论，以及任何由此产生的建议。我们计算了Cohen’s κ来评估LLM和人类之间的翻译一致性，并对关于建议的分歧进行定性分类。结果：评估者间信度显示了测量之间的“强”和“中等”一致性水平：一致性最强的是是否讨论了症状（k = 0.89），其次是谁发起了讨论（k = 0.82），以及提供的建议（k = 0.78）。在16%的潜在讨论中，人类和法学硕士对建议的存在和/或内容存在分歧，我们将其分为九种分歧类型。结论：我们的研究结果表明，法学硕士分析临床遭遇的能力与人类相当。因此，使用法学硕士作为研究工具可能使分析患者与提供者的相互作用更加可行，这可能对评估和改善护理质量、护理不公平和提供者沟通具有更广泛的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Using Large Language Models to Analyze Symptom Discussions and Recommendations in Clinical Encounters .

Background: Patient-provider interactions could inform care quality and communication but are rarely leveraged because collecting and analyzing them is both time-consuming and methodologically complex. The growing availability of large language models (LLMs) makes these analyses more feasible, though their accuracy remains uncertain. Objectives: Assess an LLM's ability to analyze patient-provider interactions. Design: Compare a human's and an LLM's codings of clinical encounter transcripts. Setting/Subjects: Two hundred and thirty-six potential symptom discussions from transcripts of clinical encounters with 92 patients living with cancer in the mid-Atlantic United States. Transcripts were analyzed by GPT4DFCI in our hospital's Health Insurance Portability and Accountability Act compliant infrastructure instance of GPT-4 (OpenAI). Measurements: Human and an LLM-coded transcripts to determine whether a patient's reported symptom(s) were discussed, who initiated the discussion, and any resulting recommendation. We calculated Cohen's κ to assess interrater agreement between the LLM and human and qualitatively classified disagreements about recommendations. Results: Interrater reliability indicated "strong" and "moderate" agreement levels across measures: Agreement was strongest for whether the symptom was discussed (k = 0.89), followed by who initiated the discussion (k = 0.82), and the recommendation provided (k = 0.78). The human and LLM disagreed on the presence and/or content of the recommendation in 16% of potential discussions, which we categorized into nine types of disagreements. Conclusions: Our results suggest that LLMs' abilities to analyze clinical encounters are equivalent to humans. Thus, using LLMs as a research tool may make it more feasible to analyze patient-provider interactions, which could have broader implications for assessing and improving care quality, care inequities, and provider communication.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of palliative medicine 医学-卫生保健

CiteScore

3.90

自引率

10.70%

发文量

345

审稿时长

2 months

期刊介绍： Journal of Palliative Medicine is the premier peer-reviewed journal covering medical, psychosocial, policy, and legal issues in end-of-life care and relief of suffering for patients with intractable pain. The Journal presents essential information for professionals in hospice/palliative medicine, focusing on improving quality of life for patients and their families, and the latest developments in drug and non-drug treatments. The companion biweekly eNewsletter, Briefings in Palliative Medicine, delivers the latest breaking news and information to keep clinicians and health care providers continuously updated.