A Generative Pretrained Transformer (GPT)-Powered Chatbot as a Simulated Patient to Practice History Taking: Prospective, Mixed Methods Study.

IF 3.2 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

JMIR Medical Education Pub Date : 2024-01-16 DOI:10.2196/53961

Friederike Holderried, Christian Stegemann-Philipps, Lea Herschbach, Julia-Astrid Moldt, Andrew Nevins, Jan Griewatz, Martin Holderried, Anne Herrmann-Werner, Teresa Festl-Wietek, Moritz Mahling

{"title":"A Generative Pretrained Transformer (GPT)-Powered Chatbot as a Simulated Patient to Practice History Taking: Prospective, Mixed Methods Study.","authors":"Friederike Holderried, Christian Stegemann-Philipps, Lea Herschbach, Julia-Astrid Moldt, Andrew Nevins, Jan Griewatz, Martin Holderried, Anne Herrmann-Werner, Teresa Festl-Wietek, Moritz Mahling","doi":"10.2196/53961","DOIUrl":null,"url":null,"abstract":"Background: Communication is a core competency of medical professionals and of utmost importance for patient safety. Although medical curricula emphasize communication training, traditional formats, such as real or simulated patient interactions, can present psychological stress and are limited in repetition. The recent emergence of large language models (LLMs), such as generative pretrained transformer (GPT), offers an opportunity to overcome these restrictions.Objective: The aim of this study was to explore the feasibility of a GPT-driven chatbot to practice history taking, one of the core competencies of communication.Methods: We developed an interactive chatbot interface using GPT-3.5 and a specific prompt including a chatbot-optimized illness script and a behavioral component. Following a mixed methods approach, we invited medical students to voluntarily practice history taking. To determine whether GPT provides suitable answers as a simulated patient, the conversations were recorded and analyzed using quantitative and qualitative approaches. We analyzed the extent to which the questions and answers aligned with the provided script, as well as the medical plausibility of the answers. Finally, the students filled out the Chatbot Usability Questionnaire (CUQ).Results: A total of 28 students practiced with our chatbot (mean age 23.4, SD 2.9 years). We recorded a total of 826 question-answer pairs (QAPs), with a median of 27.5 QAPs per conversation and 94.7% (n=782) pertaining to history taking. When questions were explicitly covered by the script (n=502, 60.3%), the GPT-provided answers were mostly based on explicit script information (n=471, 94.4%). For questions not covered by the script (n=195, 23.4%), the GPT answers used 56.4% (n=110) fictitious information. Regarding plausibility, 842 (97.9%) of 860 QAPs were rated as plausible. Of the 14 (2.1%) implausible answers, GPT provided answers rated as socially desirable, leaving role identity, ignoring script information, illogical reasoning, and calculation error. Despite these results, the CUQ revealed an overall positive user experience (77/100 points).Conclusions: Our data showed that LLMs, such as GPT, can provide a simulated patient experience and yield a good user experience and a majority of plausible answers. Our analysis revealed that GPT-provided answers use either explicit script information or are based on available information, which can be understood as abductive reasoning. Although rare, the GPT-based chatbot provides implausible information in some instances, with the major tendency being socially desirable instead of medically plausible information.","PeriodicalId":36236,"journal":{"name":"JMIR Medical Education","volume":null,"pages":null},"PeriodicalIF":3.2000,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10828948/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/53961","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Communication is a core competency of medical professionals and of utmost importance for patient safety. Although medical curricula emphasize communication training, traditional formats, such as real or simulated patient interactions, can present psychological stress and are limited in repetition. The recent emergence of large language models (LLMs), such as generative pretrained transformer (GPT), offers an opportunity to overcome these restrictions.

Objective: The aim of this study was to explore the feasibility of a GPT-driven chatbot to practice history taking, one of the core competencies of communication.

Methods: We developed an interactive chatbot interface using GPT-3.5 and a specific prompt including a chatbot-optimized illness script and a behavioral component. Following a mixed methods approach, we invited medical students to voluntarily practice history taking. To determine whether GPT provides suitable answers as a simulated patient, the conversations were recorded and analyzed using quantitative and qualitative approaches. We analyzed the extent to which the questions and answers aligned with the provided script, as well as the medical plausibility of the answers. Finally, the students filled out the Chatbot Usability Questionnaire (CUQ).

Results: A total of 28 students practiced with our chatbot (mean age 23.4, SD 2.9 years). We recorded a total of 826 question-answer pairs (QAPs), with a median of 27.5 QAPs per conversation and 94.7% (n=782) pertaining to history taking. When questions were explicitly covered by the script (n=502, 60.3%), the GPT-provided answers were mostly based on explicit script information (n=471, 94.4%). For questions not covered by the script (n=195, 23.4%), the GPT answers used 56.4% (n=110) fictitious information. Regarding plausibility, 842 (97.9%) of 860 QAPs were rated as plausible. Of the 14 (2.1%) implausible answers, GPT provided answers rated as socially desirable, leaving role identity, ignoring script information, illogical reasoning, and calculation error. Despite these results, the CUQ revealed an overall positive user experience (77/100 points).

Conclusions: Our data showed that LLMs, such as GPT, can provide a simulated patient experience and yield a good user experience and a majority of plausible answers. Our analysis revealed that GPT-provided answers use either explicit script information or are based on available information, which can be understood as abductive reasoning. Although rare, the GPT-based chatbot provides implausible information in some instances, with the major tendency being socially desirable instead of medically plausible information.

查看原文本刊更多论文

生成式预训练转换器（GPT）驱动的聊天机器人作为模拟病人练习病史采集：前瞻性混合方法研究。

背景：沟通是医务人员的核心能力，对患者安全至关重要。虽然医学课程强调沟通训练，但传统的形式，如真实或模拟的病人互动，会给人带来心理压力，而且重复性有限。最近出现的大型语言模型（LLM），如生成预训练转换器（GPT），为克服这些限制提供了机会：本研究旨在探索 GPT 驱动的聊天机器人练习历史记录的可行性，历史记录是交流的核心能力之一：方法：我们使用 GPT-3.5 开发了一个交互式聊天机器人界面，其中包括一个聊天机器人优化疾病脚本和一个行为组件。我们采用混合方法，邀请医学生自愿练习病史采集。为了确定 GPT 是否能作为模拟病人提供合适的答案，我们使用定量和定性方法记录并分析了对话。我们分析了问题和答案与所提供脚本的一致程度，以及答案在医学上的可信度。最后，学生们填写了聊天机器人可用性问卷（CUQ）：共有 28 名学生使用我们的聊天机器人进行了练习（平均年龄 23.4 岁，标准差 2.9 岁）。我们共记录了 826 个问答对（QAPs），每次对话的 QAPs 中位数为 27.5 个，94.7%（n=782）与病史采集有关。当脚本明确涵盖问题时（n=502，60.3%），GPT 提供的答案大多基于明确的脚本信息（n=471，94.4%）。对于脚本未涉及的问题（n=195，23.4%），GPT 的答案使用了 56.4% （n=110）的虚构信息。在可信度方面，860 份 QAP 中有 842 份（97.9%）被评为可信。在 14 个（2.1%）不合情理的答案中，GPT 提供的答案被评为 "符合社会需要"、"脱离角色身份"、"忽略脚本信息"、"不合逻辑推理 "和 "计算错误"。尽管出现了这些结果，但 CUQ 显示用户体验总体良好（77/100 分）：我们的数据表明，GPT 等 LLM 可以提供模拟患者体验，并产生良好的用户体验和大多数合理的答案。我们的分析表明，GPT 提供的答案要么使用了明确的脚本信息，要么基于现有信息，这可以理解为归纳推理。基于 GPT 的聊天机器人在某些情况下提供了不可信的信息，尽管这种情况很少见，但主要倾向于提供社会所需的信息，而不是医学上可信的信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR Medical Education Social Sciences-Education

CiteScore

6.90

自引率

5.60%

发文量

审稿时长

8 weeks