Educational Utility of Clinical Vignettes Generated in Japanese by ChatGPT-4: Mixed Methods Study.

IF 3.2 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

JMIR Medical Education Pub Date : 2024-08-13 DOI:10.2196/59133

Hiromizu Takahashi, Kiyoshi Shikino, Takeshi Kondo, Akira Komori, Yuji Yamada, Mizue Saita, Toshio Naito

{"title":"Educational Utility of Clinical Vignettes Generated in Japanese by ChatGPT-4: Mixed Methods Study.","authors":"Hiromizu Takahashi, Kiyoshi Shikino, Takeshi Kondo, Akira Komori, Yuji Yamada, Mizue Saita, Toshio Naito","doi":"10.2196/59133","DOIUrl":null,"url":null,"abstract":"Background: Evaluating the accuracy and educational utility of artificial intelligence-generated medical cases, especially those produced by large language models such as ChatGPT-4 (developed by OpenAI), is crucial yet underexplored.Objective: This study aimed to assess the educational utility of ChatGPT-4-generated clinical vignettes and their applicability in educational settings.Methods: Using a convergent mixed methods design, a web-based survey was conducted from January 8 to 28, 2024, to evaluate 18 medical cases generated by ChatGPT-4 in Japanese. In the survey, 6 main question items were used to evaluate the quality of the generated clinical vignettes and their educational utility, which are information quality, information accuracy, educational usefulness, clinical match, terminology accuracy (TA), and diagnosis difficulty. Feedback was solicited from physicians specializing in general internal medicine or general medicine and experienced in medical education. Chi-square and Mann-Whitney U tests were performed to identify differences among cases, and linear regression was used to examine trends associated with physicians' experience. Thematic analysis of qualitative feedback was performed to identify areas for improvement and confirm the educational utility of the cases.Results: Of the 73 invited participants, 71 (97%) responded. The respondents, primarily male (64/71, 90%), spanned a broad range of practice years (from 1976 to 2017) and represented diverse hospital sizes throughout Japan. The majority deemed the information quality (mean 0.77, 95% CI 0.75-0.79) and information accuracy (mean 0.68, 95% CI 0.65-0.71) to be satisfactory, with these responses being based on binary data. The average scores assigned were 3.55 (95% CI 3.49-3.60) for educational usefulness, 3.70 (95% CI 3.65-3.75) for clinical match, 3.49 (95% CI 3.44-3.55) for TA, and 2.34 (95% CI 2.28-2.40) for diagnosis difficulty, based on a 5-point Likert scale. Statistical analysis showed significant variability in content quality and relevance across the cases (P<.001 after Bonferroni correction). Participants suggested improvements in generating physical findings, using natural language, and enhancing medical TA. The thematic analysis highlighted the need for clearer documentation, clinical information consistency, content relevance, and patient-centered case presentations.Conclusions: ChatGPT-4-generated medical cases written in Japanese possess considerable potential as resources in medical education, with recognized adequacy in quality and accuracy. Nevertheless, there is a notable need for enhancements in the precision and realism of case details. This study emphasizes ChatGPT-4's value as an adjunctive educational tool in the medical field, requiring expert oversight for optimal application.","PeriodicalId":36236,"journal":{"name":"JMIR Medical Education","volume":"10 ","pages":"e59133"},"PeriodicalIF":3.2000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11350316/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/59133","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Evaluating the accuracy and educational utility of artificial intelligence-generated medical cases, especially those produced by large language models such as ChatGPT-4 (developed by OpenAI), is crucial yet underexplored.

Objective: This study aimed to assess the educational utility of ChatGPT-4-generated clinical vignettes and their applicability in educational settings.

Methods: Using a convergent mixed methods design, a web-based survey was conducted from January 8 to 28, 2024, to evaluate 18 medical cases generated by ChatGPT-4 in Japanese. In the survey, 6 main question items were used to evaluate the quality of the generated clinical vignettes and their educational utility, which are information quality, information accuracy, educational usefulness, clinical match, terminology accuracy (TA), and diagnosis difficulty. Feedback was solicited from physicians specializing in general internal medicine or general medicine and experienced in medical education. Chi-square and Mann-Whitney U tests were performed to identify differences among cases, and linear regression was used to examine trends associated with physicians' experience. Thematic analysis of qualitative feedback was performed to identify areas for improvement and confirm the educational utility of the cases.

Results: Of the 73 invited participants, 71 (97%) responded. The respondents, primarily male (64/71, 90%), spanned a broad range of practice years (from 1976 to 2017) and represented diverse hospital sizes throughout Japan. The majority deemed the information quality (mean 0.77, 95% CI 0.75-0.79) and information accuracy (mean 0.68, 95% CI 0.65-0.71) to be satisfactory, with these responses being based on binary data. The average scores assigned were 3.55 (95% CI 3.49-3.60) for educational usefulness, 3.70 (95% CI 3.65-3.75) for clinical match, 3.49 (95% CI 3.44-3.55) for TA, and 2.34 (95% CI 2.28-2.40) for diagnosis difficulty, based on a 5-point Likert scale. Statistical analysis showed significant variability in content quality and relevance across the cases (P<.001 after Bonferroni correction). Participants suggested improvements in generating physical findings, using natural language, and enhancing medical TA. The thematic analysis highlighted the need for clearer documentation, clinical information consistency, content relevance, and patient-centered case presentations.

Conclusions: ChatGPT-4-generated medical cases written in Japanese possess considerable potential as resources in medical education, with recognized adequacy in quality and accuracy. Nevertheless, there is a notable need for enhancements in the precision and realism of case details. This study emphasizes ChatGPT-4's value as an adjunctive educational tool in the medical field, requiring expert oversight for optimal application.

查看原文本刊更多论文

由 ChatGPT-4 生成的日语临床小故事的教育效用：混合方法研究。

背景：评估人工智能生成的医学病例，尤其是由大型语言模型（如ChatGPT-4，由OpenAI开发）生成的医学病例的准确性和教育效用至关重要，但这方面的探索还不够：本研究旨在评估 ChatGPT-4 生成的临床案例的教育效用及其在教育环境中的适用性：方法：采用聚合混合方法设计，于 2024 年 1 月 8 日至 28 日开展了一项基于网络的调查，对 ChatGPT-4 生成的 18 个日语医疗案例进行了评估。调查中使用了 6 个主要问题项目来评估生成的临床案例的质量及其教育实用性，即信息质量、信息准确性、教育实用性、临床匹配性、术语准确性（TA）和诊断难度。我们向具有医学教育经验的普通内科或全科医生征求了反馈意见。采用卡方检验（Chi-square）和曼-惠特尼U检验（Mann-Whitney U）来确定不同病例之间的差异，并使用线性回归来研究与医生经验相关的趋势。对定性反馈进行了主题分析，以确定需要改进的地方，并确认案例的教育效用：在 73 名受邀参与者中，有 71 人（97%）做出了回应。受访者主要为男性（64/71，90%），执业年限跨度很大（从 1976 年到 2017 年），代表了日本各地不同规模的医院。大多数人认为信息质量（平均值 0.77，95% CI 0.75-0.79）和信息准确性（平均值 0.68，95% CI 0.65-0.71）令人满意，这些回答基于二进制数据。根据 5 点李克特量表，教育有用性的平均得分为 3.55（95% CI 3.49-3.60），临床匹配度为 3.70（95% CI 3.65-3.75），TA 为 3.49（95% CI 3.44-3.55），诊断难度为 2.34（95% CI 2.28-2.40）。统计分析表明，不同病例在内容质量和相关性方面存在明显差异（PC 结论：用日语编写的 ChatGPT-4 生成的医学病例在质量和准确性方面都得到了认可，具有作为医学教育资源的巨大潜力。然而，在病例细节的精确性和真实性方面仍有明显的改进需求。本研究强调了 ChatGPT-4 作为医学领域辅助教育工具的价值，需要专家的监督才能达到最佳应用效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊