ChatGPT's Performance on Portuguese Medical Examination Questions: Comparative Analysis of ChatGPT-3.5 Turbo and ChatGPT-4o Mini.

IF 3.2 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

JMIR Medical Education Pub Date : 2025-03-05 DOI:10.2196/65108

Filipe Prazeres

{"title":"ChatGPT's Performance on Portuguese Medical Examination Questions: Comparative Analysis of ChatGPT-3.5 Turbo and ChatGPT-4o Mini.","authors":"Filipe Prazeres","doi":"10.2196/65108","DOIUrl":null,"url":null,"abstract":"Background: Advancements in ChatGPT are transforming medical education by providing new tools for assessment and learning, potentially enhancing evaluations for doctors and improving instructional effectiveness.Objective: This study evaluates the performance and consistency of ChatGPT-3.5 Turbo and ChatGPT-4o mini in solving European Portuguese medical examination questions (2023 National Examination for Access to Specialized Training; Prova Nacional de Acesso à Formação Especializada [PNA]) and compares their performance to human candidates.Methods: ChatGPT-3.5 Turbo was tested on the first part of the examination (74 questions) on July 18, 2024, and ChatGPT-4o mini on the second part (74 questions) on July 19, 2024. Each model generated an answer using its natural language processing capabilities. To test consistency, each model was asked, \"Are you sure?\" after providing an answer. Differences between the first and second responses of each model were analyzed using the McNemar test with continuity correction. A single-parameter t test compared the models' performance to human candidates. Frequencies and percentages were used for categorical variables, and means and CIs for numerical variables. Statistical significance was set at P<.05.Results: ChatGPT-4o mini achieved an accuracy rate of 65% (48/74) on the 2023 PNA examination, surpassing ChatGPT-3.5 Turbo. ChatGPT-4o mini outperformed medical candidates, while ChatGPT-3.5 Turbo had a more moderate performance.Conclusions: This study highlights the advancements and potential of ChatGPT models in medical education, emphasizing the need for careful implementation with teacher oversight and further research.","PeriodicalId":36236,"journal":{"name":"JMIR Medical Education","volume":"11 ","pages":"e65108"},"PeriodicalIF":3.2000,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11902880/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/65108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Advancements in ChatGPT are transforming medical education by providing new tools for assessment and learning, potentially enhancing evaluations for doctors and improving instructional effectiveness.

Objective: This study evaluates the performance and consistency of ChatGPT-3.5 Turbo and ChatGPT-4o mini in solving European Portuguese medical examination questions (2023 National Examination for Access to Specialized Training; Prova Nacional de Acesso à Formação Especializada [PNA]) and compares their performance to human candidates.

Methods: ChatGPT-3.5 Turbo was tested on the first part of the examination (74 questions) on July 18, 2024, and ChatGPT-4o mini on the second part (74 questions) on July 19, 2024. Each model generated an answer using its natural language processing capabilities. To test consistency, each model was asked, "Are you sure?" after providing an answer. Differences between the first and second responses of each model were analyzed using the McNemar test with continuity correction. A single-parameter t test compared the models' performance to human candidates. Frequencies and percentages were used for categorical variables, and means and CIs for numerical variables. Statistical significance was set at P<.05.

Results: ChatGPT-4o mini achieved an accuracy rate of 65% (48/74) on the 2023 PNA examination, surpassing ChatGPT-3.5 Turbo. ChatGPT-4o mini outperformed medical candidates, while ChatGPT-3.5 Turbo had a more moderate performance.

Conclusions: This study highlights the advancements and potential of ChatGPT models in medical education, emphasizing the need for careful implementation with teacher oversight and further research.

查看原文本刊更多论文

ChatGPT在葡萄牙医学考题中的表现：ChatGPT-3.5 Turbo与ChatGPT- 40 Mini的对比分析

背景：ChatGPT的进步正在通过提供新的评估和学习工具来改变医学教育，有可能加强对医生的评估并提高教学效率。目的：评价ChatGPT-3.5 Turbo和chatgpt - 40 mini在解决欧洲葡萄牙医学考题(2023年全国专科培训资格考试；（PNA），并将它们的表现与人类候选人进行比较。方法：ChatGPT-3.5 Turbo于2024年7月18日参加第一部分考试（74题），chatgpt - 40 mini于2024年7月19日参加第二部分考试（74题）。每个模型都使用其自然语言处理能力生成答案。为了测试一致性，每个模型在给出答案后都会被问到：“你确定吗？”每个模型的第一和第二反应之间的差异分析使用连续性校正McNemar检验。单参数t检验比较了模型与人类候选人的表现。频率和百分比用于分类变量，均值和ci用于数值变量。结果显示，chatgpt - 40 mini在2023 PNA检查中达到65%（48/74）的准确率，超过ChatGPT-3.5 Turbo。chatgpt - 40 mini的表现优于医疗候选人，而ChatGPT-3.5 Turbo的表现更为温和。结论：本研究强调了ChatGPT模型在医学教育中的进步和潜力，强调了在教师监督下仔细实施和进一步研究的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊