{"title":"Performance of ChatGPT-4 on the Nepalese Undergraduate Medical Licensing Examination: A Cross-Sectional Study.","authors":"Prajjwol Luitel, Sujan Paudel, Devansh Upadhya, Amit Yadav, Gehendra Jung Kunwar","doi":"10.1177/23821205251384836","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>ChatGPT has shown remarkable performance in medical licensing examinations such as the United States Medical Licensing Examination. However, limited research exists regarding its performance on national medical licensing exams in low-income countries. In Nepal, where nearly half of the candidates fail the national medical licensing exam, ChatGPT has the potential to contribute to medical education.</p><p><strong>Objective: </strong>To evaluate ChatGPT's (GPT-4) performance on the Nepal Medical Council Licensing Medical Examination (NMCLE).</p><p><strong>Methods: </strong>The NMCLE-May 2024 dataset, comprising 900 multiple-choice questions, was used to assess ChatGPT's performance. After excluding 8 questions that contained figures or were not compatible with text-only input, 892 questions were analyzed. Specific prompt, including a background description, question, and choices, was entered. The response generated by ChatGPT was compared taking responses from experienced clinicians as a reference. Descriptive statistics were used to present the results, and regression analysis was employed to determine the association between variables, including set, question type, pattern, and subject, and incorrect responses.</p><p><strong>Results: </strong>GPT-4 generated 783 correct responses in 892 questions, an accuracy rate of 87.8%. Incorrect responses were more likely with questions requiring logical reasoning (odds ratio 14.7, 95% confidence interval [CI] 8.94-24.16).</p><p><strong>Conclusions: </strong>ChatGPT-4 performs at a standard comparable to or above that of medical graduates on the Nepalese undergraduate medical licensing examination. Incorrect responses were mainly in questions requiring logical reasoning, underscoring the need for caution when relying on its outputs in the same. These findings are encouraging and highlight the need for further studies to evaluate its role as an educational resource in Nepalese medical education.</p>","PeriodicalId":45121,"journal":{"name":"Journal of Medical Education and Curricular Development","volume":"12 ","pages":"23821205251384836"},"PeriodicalIF":1.6000,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12501450/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Education and Curricular Development","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/23821205251384836","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: ChatGPT has shown remarkable performance in medical licensing examinations such as the United States Medical Licensing Examination. However, limited research exists regarding its performance on national medical licensing exams in low-income countries. In Nepal, where nearly half of the candidates fail the national medical licensing exam, ChatGPT has the potential to contribute to medical education.
Objective: To evaluate ChatGPT's (GPT-4) performance on the Nepal Medical Council Licensing Medical Examination (NMCLE).
Methods: The NMCLE-May 2024 dataset, comprising 900 multiple-choice questions, was used to assess ChatGPT's performance. After excluding 8 questions that contained figures or were not compatible with text-only input, 892 questions were analyzed. Specific prompt, including a background description, question, and choices, was entered. The response generated by ChatGPT was compared taking responses from experienced clinicians as a reference. Descriptive statistics were used to present the results, and regression analysis was employed to determine the association between variables, including set, question type, pattern, and subject, and incorrect responses.
Results: GPT-4 generated 783 correct responses in 892 questions, an accuracy rate of 87.8%. Incorrect responses were more likely with questions requiring logical reasoning (odds ratio 14.7, 95% confidence interval [CI] 8.94-24.16).
Conclusions: ChatGPT-4 performs at a standard comparable to or above that of medical graduates on the Nepalese undergraduate medical licensing examination. Incorrect responses were mainly in questions requiring logical reasoning, underscoring the need for caution when relying on its outputs in the same. These findings are encouraging and highlight the need for further studies to evaluate its role as an educational resource in Nepalese medical education.