Summarizing Online Patient Conversations Using Generative Language Models: Experimental and Comparative Study.

IF 3.8 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2025-04-14 DOI:10.2196/62909

Rakhi Asokkumar Subjagouri Nair, Matthias Hartung, Philipp Heinisch, Janik Jaskolski, Cornelius Starke-Knäusel, Susana Veríssimo, David Maria Schmidt, Philipp Cimiano

{"title":"Summarizing Online Patient Conversations Using Generative Language Models: Experimental and Comparative Study.","authors":"Rakhi Asokkumar Subjagouri Nair, Matthias Hartung, Philipp Heinisch, Janik Jaskolski, Cornelius Starke-Knäusel, Susana Veríssimo, David Maria Schmidt, Philipp Cimiano","doi":"10.2196/62909","DOIUrl":null,"url":null,"abstract":"Background: Social media is acknowledged by regulatory bodies (eg, the Food and Drug Administration) as an important source of patient experience data to learn about patients' unmet needs, priorities, and preferences. However, current methods rely either on manual analysis and do not scale, or on automatic processing, yielding mainly quantitative insights. Methods that can automatically summarize texts and yield qualitative insights at scale are missing.Objective: The objective of this study was to evaluate to what extent state-of-the-art large language models can appropriately summarize posts shared by patients in web-based forums and health communities. Specifically, the goal was to compare the performance of different language models and prompting strategies on the task of summarizing documents reflecting the experiences of individual patients.Methods: In our experimental and comparative study, we applied 3 different language models (Flan-T5, Generative Pretrained Transformer [GPT], GPT-3, and GPT-3.5) in combination with various prompting strategies to the task of summarizing posts from patients in online communities. The generated summaries were evaluated with respect to 124 manually created summaries as a ground-truth reference. As evaluation metrics, we used 2 standard metrics from the field of text generation, namely, Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and BERTScore, to compare the automatically generated summaries to the manually created reference summaries.Results: Among the zero-shot prompting-based large language models investigated, GPT-3.5 performed better than the other models with respect to the ROUGE metrics, as well as with respect to BERTScore. While zero-shot prompting seems to be a good prompting strategy, overall GPT-3.5 in combination with directional stimulus prompting in a 3-shot setting had the best results with respect to the aforementioned metrics. A manual investigation of the summarization of the best-performing method showed that the generated summaries were accurate and plausible compared to the manual summaries.Conclusions: Taken together, our results suggest that state-of-the-art pretrained language models are a valuable tool to provide qualitative insights about the patient experience to better understand unmet needs, patient priorities, and how a disease impacts daily functioning and quality of life to inform processes aimed at improving health care delivery and ensure that drug development focuses more on the actual priorities and unmet needs of patients. The key limitations of our work are the small data sample as well as the fact that the manual summaries were created by 1 annotator only. Furthermore, the results hold only for the examined models and prompting strategies, potentially not generalizing to other models and strategies.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e62909"},"PeriodicalIF":3.8000,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12038288/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/62909","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Social media is acknowledged by regulatory bodies (eg, the Food and Drug Administration) as an important source of patient experience data to learn about patients' unmet needs, priorities, and preferences. However, current methods rely either on manual analysis and do not scale, or on automatic processing, yielding mainly quantitative insights. Methods that can automatically summarize texts and yield qualitative insights at scale are missing.

Objective: The objective of this study was to evaluate to what extent state-of-the-art large language models can appropriately summarize posts shared by patients in web-based forums and health communities. Specifically, the goal was to compare the performance of different language models and prompting strategies on the task of summarizing documents reflecting the experiences of individual patients.

Methods: In our experimental and comparative study, we applied 3 different language models (Flan-T5, Generative Pretrained Transformer [GPT], GPT-3, and GPT-3.5) in combination with various prompting strategies to the task of summarizing posts from patients in online communities. The generated summaries were evaluated with respect to 124 manually created summaries as a ground-truth reference. As evaluation metrics, we used 2 standard metrics from the field of text generation, namely, Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and BERTScore, to compare the automatically generated summaries to the manually created reference summaries.

Results: Among the zero-shot prompting-based large language models investigated, GPT-3.5 performed better than the other models with respect to the ROUGE metrics, as well as with respect to BERTScore. While zero-shot prompting seems to be a good prompting strategy, overall GPT-3.5 in combination with directional stimulus prompting in a 3-shot setting had the best results with respect to the aforementioned metrics. A manual investigation of the summarization of the best-performing method showed that the generated summaries were accurate and plausible compared to the manual summaries.

Conclusions: Taken together, our results suggest that state-of-the-art pretrained language models are a valuable tool to provide qualitative insights about the patient experience to better understand unmet needs, patient priorities, and how a disease impacts daily functioning and quality of life to inform processes aimed at improving health care delivery and ensure that drug development focuses more on the actual priorities and unmet needs of patients. The key limitations of our work are the small data sample as well as the fact that the manual summaries were created by 1 annotator only. Furthermore, the results hold only for the examined models and prompting strategies, potentially not generalizing to other models and strategies.

查看原文本刊更多论文

总结使用生成语言模型的在线患者对话：实验和比较研究。

背景：社交媒体被监管机构（如食品和药物管理局）认可为患者体验数据的重要来源，可以了解患者未满足的需求、优先事项和偏好。然而，当前的方法要么依赖于手工分析，不能扩展，要么依赖于自动处理，主要产生定量的见解。能够自动总结文本并产生大规模定性见解的方法是缺失的。目的：本研究的目的是评估最先进的大语言模型在多大程度上可以适当地总结患者在网络论坛和卫生社区中分享的帖子。具体而言，目的是比较不同语言模型和提示策略在总结反映个体患者经验的文件任务中的表现。方法：采用Flan-T5、生成式预训练转换器（GPT）、GPT-3和GPT-3.5三种不同的语言模型，结合不同的提示策略，对网络社区患者的帖子进行总结。生成的摘要相对于124个手动创建的摘要作为真实参考进行评估。作为评价指标，我们使用了文本生成领域的2个标准指标，即面向回忆的注册评价（Recall-Oriented Understudy for Gisting evaluation， ROUGE）和BERTScore，将自动生成的摘要与人工创建的参考摘要进行比较。结果：在研究的基于零提示的大型语言模型中，GPT-3.5在ROUGE指标和BERTScore方面的表现优于其他模型。虽然零弹提示似乎是一个很好的提示策略，但总体而言，GPT-3.5与定向刺激提示相结合，在3弹设置中，就上述指标而言，效果最好。对性能最好的方法的总结进行的手工调查表明，与手工总结相比，生成的总结是准确和可信的。综上所述，我们的研究结果表明，最先进的预训练语言模型是一种有价值的工具，可以提供关于患者体验的定性见解，以更好地了解未满足的需求、患者的优先事项，以及疾病如何影响日常功能和生活质量，从而为旨在改善医疗保健服务的过程提供信息，并确保药物开发更多地关注患者的实际优先事项和未满足的需求。我们工作的主要限制是数据样本小，而且手工摘要仅由一个注释者创建。此外，结果仅适用于所检查的模型和提示策略，可能无法推广到其他模型和策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.