Enhancing responses from large language models with role-playing prompts: a comparative study on answering frequently asked questions about total knee arthroplasty.

IF 3.3 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2025-05-23 DOI:10.1186/s12911-025-03024-5

Yi-Chen Chen, Sheng-Hsun Lee, Huan Sheu, Sheng-Hsuan Lin, Chih-Chien Hu, Shih-Chen Fu, Cheng-Pang Yang, Yu-Chih Lin

{"title":"Enhancing responses from large language models with role-playing prompts: a comparative study on answering frequently asked questions about total knee arthroplasty.","authors":"Yi-Chen Chen, Sheng-Hsun Lee, Huan Sheu, Sheng-Hsuan Lin, Chih-Chien Hu, Shih-Chen Fu, Cheng-Pang Yang, Yu-Chih Lin","doi":"10.1186/s12911-025-03024-5","DOIUrl":null,"url":null,"abstract":"Background: The application of artificial intelligence (AI) in medical education and patient interaction is rapidly growing. Large language models (LLMs) such as GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus have shown potential in providing relevant medical information. This study aims to evaluate and compare the performance of these LLMs in answering frequently asked questions (FAQs) about Total Knee Arthroplasty (TKA), with a specific focus on the impact of role-playing prompts.Methods: Four leading LLMs-GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus-were evaluated using ten standardized patient inquiries related to TKA. Each model produced two distinct responses per question: one generated under zero-shot prompting (question-only), and one under role-playing prompting (instructed to simulate an experienced orthopaedic surgeon). Four orthopaedic surgeons evaluated responses for accuracy and comprehensiveness on a 5-point Likert scale, along with a binary measure for acceptability. Statistical analyses (Wilcoxon rank sum and Chi-squared tests; P < 0.05) were conducted to compare model performance.Results: ChatGPT-4 with role-playing prompts achieved the highest scores for accuracy (3.73), comprehensiveness (4.05), and acceptability (77.5%), followed closely by ChatGPT-3.5 with role-playing prompts (3.70, 3.85, 72.5%, respectively). Google Gemini and Claude 3 Opus demonstrated lower performance across all metrics. In between-model comparisons based on zero-shot prompting, ChatGPT-4 achieved significantly higher scores of both accuracy and comprehensiveness relative to Google Gemini (P = 0.031 and P = 0.009, respectively) and Claude 3 Opus (P = 0.019 and P = 0.002), and demonstrated higher acceptability than Claude 3 Opus (P = 0.006). Within-model comparisons showed role-playing significantly improved all metrics for ChatGPT-3.5 (P < 0.05) and acceptability for ChatGPT-4 (P = 0.033). No significant prompting effects were observed for Gemini or Claude.Conclusions: This study demonstrates that role-playing prompts significantly enhance the performance of LLMs, particularly for ChatGPT-3.5 and ChatGPT-4, in answering FAQs related to TKA. ChatGPT-4, with role-playing prompts, showed superior performance in terms of accuracy, comprehensiveness, and acceptability. Despite occasional inaccuracies, LLMs hold promise for improving patient education and clinical decision-making in orthopaedic practice.Clinical trial number: Not applicable.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"196"},"PeriodicalIF":3.3000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12102839/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-03024-5","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The application of artificial intelligence (AI) in medical education and patient interaction is rapidly growing. Large language models (LLMs) such as GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus have shown potential in providing relevant medical information. This study aims to evaluate and compare the performance of these LLMs in answering frequently asked questions (FAQs) about Total Knee Arthroplasty (TKA), with a specific focus on the impact of role-playing prompts.

Methods: Four leading LLMs-GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus-were evaluated using ten standardized patient inquiries related to TKA. Each model produced two distinct responses per question: one generated under zero-shot prompting (question-only), and one under role-playing prompting (instructed to simulate an experienced orthopaedic surgeon). Four orthopaedic surgeons evaluated responses for accuracy and comprehensiveness on a 5-point Likert scale, along with a binary measure for acceptability. Statistical analyses (Wilcoxon rank sum and Chi-squared tests; P < 0.05) were conducted to compare model performance.

Results: ChatGPT-4 with role-playing prompts achieved the highest scores for accuracy (3.73), comprehensiveness (4.05), and acceptability (77.5%), followed closely by ChatGPT-3.5 with role-playing prompts (3.70, 3.85, 72.5%, respectively). Google Gemini and Claude 3 Opus demonstrated lower performance across all metrics. In between-model comparisons based on zero-shot prompting, ChatGPT-4 achieved significantly higher scores of both accuracy and comprehensiveness relative to Google Gemini (P = 0.031 and P = 0.009, respectively) and Claude 3 Opus (P = 0.019 and P = 0.002), and demonstrated higher acceptability than Claude 3 Opus (P = 0.006). Within-model comparisons showed role-playing significantly improved all metrics for ChatGPT-3.5 (P < 0.05) and acceptability for ChatGPT-4 (P = 0.033). No significant prompting effects were observed for Gemini or Claude.

Conclusions: This study demonstrates that role-playing prompts significantly enhance the performance of LLMs, particularly for ChatGPT-3.5 and ChatGPT-4, in answering FAQs related to TKA. ChatGPT-4, with role-playing prompts, showed superior performance in terms of accuracy, comprehensiveness, and acceptability. Despite occasional inaccuracies, LLMs hold promise for improving patient education and clinical decision-making in orthopaedic practice.

Clinical trial number: Not applicable.

查看原文本刊更多论文

用角色扮演提示增强大型语言模型的反应：回答全膝关节置换术常见问题的比较研究。

背景：人工智能（AI）在医学教育和患者互动中的应用正在迅速增长。大型语言模型（llm），如GPT-3.5、GPT-4、谷歌Gemini和Claude 3 Opus，在提供相关医学信息方面显示出潜力。本研究旨在评估和比较这些llm在回答有关全膝关节置换术（TKA）的常见问题（FAQs）方面的表现，并特别关注角色扮演提示的影响。方法：采用与TKA相关的10项标准化患者询问，对4种领先的LLMs-GPT-3.5、GPT-4、谷歌Gemini和Claude 3 opus进行评估。每个模型对每个问题产生两种不同的回答：一种是在零射击提示（只有问题）下产生的，另一种是在角色扮演提示（指示模拟一位经验丰富的整形外科医生）下产生的。四名骨科医生根据5分李克特量表评估了反应的准确性和全面性，以及可接受性的二值测量。统计分析(Wilcoxon秩和检验和卡方检验；结果：带有角色扮演提示的ChatGPT-4在准确性（3.73）、综合性（4.05）和可接受性（77.5%）方面得分最高，紧随其后的是带有角色扮演提示的ChatGPT-3.5（分别为3.70、3.85、72.5%）。谷歌Gemini和Claude 3 Opus在所有指标上表现较差。在基于零射击提示的模型间比较中，ChatGPT-4相对于谷歌Gemini （P = 0.031和P = 0.009）和Claude 3 Opus （P = 0.019和P = 0.002）的准确性和全面性得分均显著高于Claude 3 Opus (P = 0.006)，可接受性高于Claude 3 Opus （P = 0.006）。模型内比较显示，角色扮演显著提高了ChatGPT-3.5的所有指标(P)。结论：本研究表明，角色扮演提示显著提高了LLMs，特别是ChatGPT-3.5和ChatGPT-4在回答与TKA相关的常见问题方面的表现。ChatGPT-4具有角色扮演提示，在准确性、全面性和可接受性方面表现优异。尽管偶尔会有不准确的地方，llm有望在骨科实践中改善患者教育和临床决策。临床试验号：不适用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.