Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study.

IF 3.2 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

JMIR Medical Education Pub Date : 2024-08-13 DOI:10.2196/52784

Shuai Ming, Qingge Guo, Wenjun Cheng, Bo Lei

{"title":"Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study.","authors":"Shuai Ming, Qingge Guo, Wenjun Cheng, Bo Lei","doi":"10.2196/52784","DOIUrl":null,"url":null,"abstract":"Background: With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research.Objective: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE).Methods: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt's designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model's accuracy and consistency.Results: GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%-3.7%) and GPT-3.5 (1.3%-4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response.Conclusions: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model's reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study.","PeriodicalId":36236,"journal":{"name":"JMIR Medical Education","volume":"10 ","pages":"e52784"},"PeriodicalIF":3.2000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11336778/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/52784","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research.

Objective: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE).

Methods: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt's designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model's accuracy and consistency.

Results: GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%-3.7%) and GPT-3.5 (1.3%-4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response.

Conclusions: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model's reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study.

查看原文本刊更多论文

模式演变和系统角色对中国医师资格考试中 ChatGPT 成绩的影响：比较研究。

研究背景随着大型语言模型（如 ChatGPT）在各行各业的应用日益广泛，其在医学领域，尤其是标准化考试中的潜力已成为研究的焦点：本研究旨在评估 ChatGPT 的临床表现，重点关注其在中国国家医师资格考试（CNMLE）中的准确性和可靠性：中国国家医师资格考试（CNMLE）2022年试题集由500道单项选择题组成，并重新分为15个医学亚专业。2023 年 4 月 24 日至 5 月 15 日，在 OpenAI 平台上对每道题进行了 8-12 次中文测试。测试中考虑了三个关键因素：GPT-3.5 和 4.0 版本、根据医学亚专科指定系统角色的提示以及为保持连贯性而进行的重复。通过准确率阈值定为 60%。采用χ2检验和κ值来评估模型的准确性和一致性：GPT-4.0的通过准确率为72.7%，明显高于GPT-3.5（54%；P.05）。在分组分析中，不同题型的 ChatGPT 准确率相当（P>.05）。GPT-4.0 在 15 个亚专科中的 14 个超过了准确率阈值，而 GPT-3.5 则在 15 个亚专科中的 7 个首次回答就超过了准确率阈值：结论：GPT-4.0 通过了 CNMLE 考试，并在准确性、一致性和医学亚专科专业知识等关键领域优于 GPT-3.5。添加系统角色对模型的可靠性和答案一致性的提升并不明显。GPT-4.0 在医学教育和临床实践中表现出了巨大的潜力，值得进一步研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR Medical Education Social Sciences-Education

CiteScore

6.90

自引率

5.60%

发文量

审稿时长

8 weeks