Anny T H R Fenton, Natasha Charewycz, Zarwah Kanwal, Brigitte N Durieux, Katherine I Pollack, James A Tulsky, Alexi A Wright, Charlotta J Lindvall
{"title":"使用大型语言模型来分析临床遇到的症状讨论和建议。","authors":"Anny T H R Fenton, Natasha Charewycz, Zarwah Kanwal, Brigitte N Durieux, Katherine I Pollack, James A Tulsky, Alexi A Wright, Charlotta J Lindvall","doi":"10.1177/10966218251363802","DOIUrl":null,"url":null,"abstract":"<p><p><b><i>Background:</i></b> Patient-provider interactions could inform care quality and communication but are rarely leveraged because collecting and analyzing them is both time-consuming and methodologically complex. The growing availability of large language models (LLMs) makes these analyses more feasible, though their accuracy remains uncertain. <b><i>Objectives:</i></b> Assess an LLM's ability to analyze patient-provider interactions. <b><i>Design:</i></b> Compare a human's and an LLM's codings of clinical encounter transcripts. <b><i>Setting/Subjects:</i></b> Two hundred and thirty-six potential symptom discussions from transcripts of clinical encounters with 92 patients living with cancer in the mid-Atlantic United States. Transcripts were analyzed by GPT4DFCI in our hospital's Health Insurance Portability and Accountability Act compliant infrastructure instance of GPT-4 (OpenAI). <b><i>Measurements:</i></b> Human and an LLM-coded transcripts to determine whether a patient's reported symptom(s) were discussed, who initiated the discussion, and any resulting recommendation. We calculated Cohen's κ to assess interrater agreement between the LLM and human and qualitatively classified disagreements about recommendations. <b><i>Results:</i></b> Interrater reliability indicated \"strong\" and \"moderate\" agreement levels across measures: Agreement was strongest for whether the symptom was discussed (<i>k =</i> 0.89), followed by who initiated the discussion (<i>k</i> = 0.82), and the recommendation provided (<i>k</i> = 0.78). The human and LLM disagreed on the presence and/or content of the recommendation in 16% of potential discussions, which we categorized into nine types of disagreements. <b><i>Conclusions:</i></b> Our results suggest that LLMs' abilities to analyze clinical encounters are equivalent to humans. Thus, using LLMs as a research tool may make it more feasible to analyze patient-provider interactions, which could have broader implications for assessing and improving care quality, care inequities, and provider communication.</p>","PeriodicalId":16656,"journal":{"name":"Journal of palliative medicine","volume":" ","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Using Large Language Models to Analyze Symptom Discussions and Recommendations in Clinical Encounters .\",\"authors\":\"Anny T H R Fenton, Natasha Charewycz, Zarwah Kanwal, Brigitte N Durieux, Katherine I Pollack, James A Tulsky, Alexi A Wright, Charlotta J Lindvall\",\"doi\":\"10.1177/10966218251363802\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p><b><i>Background:</i></b> Patient-provider interactions could inform care quality and communication but are rarely leveraged because collecting and analyzing them is both time-consuming and methodologically complex. The growing availability of large language models (LLMs) makes these analyses more feasible, though their accuracy remains uncertain. <b><i>Objectives:</i></b> Assess an LLM's ability to analyze patient-provider interactions. <b><i>Design:</i></b> Compare a human's and an LLM's codings of clinical encounter transcripts. <b><i>Setting/Subjects:</i></b> Two hundred and thirty-six potential symptom discussions from transcripts of clinical encounters with 92 patients living with cancer in the mid-Atlantic United States. Transcripts were analyzed by GPT4DFCI in our hospital's Health Insurance Portability and Accountability Act compliant infrastructure instance of GPT-4 (OpenAI). <b><i>Measurements:</i></b> Human and an LLM-coded transcripts to determine whether a patient's reported symptom(s) were discussed, who initiated the discussion, and any resulting recommendation. We calculated Cohen's κ to assess interrater agreement between the LLM and human and qualitatively classified disagreements about recommendations. <b><i>Results:</i></b> Interrater reliability indicated \\\"strong\\\" and \\\"moderate\\\" agreement levels across measures: Agreement was strongest for whether the symptom was discussed (<i>k =</i> 0.89), followed by who initiated the discussion (<i>k</i> = 0.82), and the recommendation provided (<i>k</i> = 0.78). The human and LLM disagreed on the presence and/or content of the recommendation in 16% of potential discussions, which we categorized into nine types of disagreements. <b><i>Conclusions:</i></b> Our results suggest that LLMs' abilities to analyze clinical encounters are equivalent to humans. Thus, using LLMs as a research tool may make it more feasible to analyze patient-provider interactions, which could have broader implications for assessing and improving care quality, care inequities, and provider communication.</p>\",\"PeriodicalId\":16656,\"journal\":{\"name\":\"Journal of palliative medicine\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2025-08-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of palliative medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1177/10966218251363802\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of palliative medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/10966218251363802","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
Using Large Language Models to Analyze Symptom Discussions and Recommendations in Clinical Encounters .
Background: Patient-provider interactions could inform care quality and communication but are rarely leveraged because collecting and analyzing them is both time-consuming and methodologically complex. The growing availability of large language models (LLMs) makes these analyses more feasible, though their accuracy remains uncertain. Objectives: Assess an LLM's ability to analyze patient-provider interactions. Design: Compare a human's and an LLM's codings of clinical encounter transcripts. Setting/Subjects: Two hundred and thirty-six potential symptom discussions from transcripts of clinical encounters with 92 patients living with cancer in the mid-Atlantic United States. Transcripts were analyzed by GPT4DFCI in our hospital's Health Insurance Portability and Accountability Act compliant infrastructure instance of GPT-4 (OpenAI). Measurements: Human and an LLM-coded transcripts to determine whether a patient's reported symptom(s) were discussed, who initiated the discussion, and any resulting recommendation. We calculated Cohen's κ to assess interrater agreement between the LLM and human and qualitatively classified disagreements about recommendations. Results: Interrater reliability indicated "strong" and "moderate" agreement levels across measures: Agreement was strongest for whether the symptom was discussed (k = 0.89), followed by who initiated the discussion (k = 0.82), and the recommendation provided (k = 0.78). The human and LLM disagreed on the presence and/or content of the recommendation in 16% of potential discussions, which we categorized into nine types of disagreements. Conclusions: Our results suggest that LLMs' abilities to analyze clinical encounters are equivalent to humans. Thus, using LLMs as a research tool may make it more feasible to analyze patient-provider interactions, which could have broader implications for assessing and improving care quality, care inequities, and provider communication.
期刊介绍:
Journal of Palliative Medicine is the premier peer-reviewed journal covering medical, psychosocial, policy, and legal issues in end-of-life care and relief of suffering for patients with intractable pain. The Journal presents essential information for professionals in hospice/palliative medicine, focusing on improving quality of life for patients and their families, and the latest developments in drug and non-drug treatments.
The companion biweekly eNewsletter, Briefings in Palliative Medicine, delivers the latest breaking news and information to keep clinicians and health care providers continuously updated.