ChatGPT in Iranian medical licensing examination: evaluating the diagnostic accuracy and decision-making capabilities of an AI-based model

IF 4.1 Q1 HEALTH CARE SCIENCES & SERVICES

BMJ Health & Care Informatics Pub Date : 2023-12-01 DOI:10.1136/bmjhci-2023-100815

Manoochehr Ebrahimian, Behdad Behnam, Negin Ghayebi, Elham Sobhrakhshankhah

{"title":"ChatGPT in Iranian medical licensing examination: evaluating the diagnostic accuracy and decision-making capabilities of an AI-based model","authors":"Manoochehr Ebrahimian, Behdad Behnam, Negin Ghayebi, Elham Sobhrakhshankhah","doi":"10.1136/bmjhci-2023-100815","DOIUrl":null,"url":null,"abstract":"Introduction Large language models such as ChatGPT have gained popularity for their ability to generate comprehensive responses to human queries. In the field of medicine, ChatGPT has shown promise in applications ranging from diagnostics to decision-making. However, its performance in medical examinations and its comparison to random guessing have not been extensively studied. Methods This study aimed to evaluate the performance of ChatGPT in the preinternship examination, a comprehensive medical assessment for students in Iran. The examination consisted of 200 multiple-choice questions categorised into basic science evaluation, diagnosis and decision-making. GPT-4 was used, and the questions were translated to English. A statistical analysis was conducted to assess the performance of ChatGPT and also compare it with a random test group. Results The results showed that ChatGPT performed exceptionally well, with 68.5% of the questions answered correctly, significantly surpassing the pass mark of 45%. It exhibited superior performance in decision-making and successfully passed all specialties. Comparing ChatGPT to the random test group, ChatGPT’s performance was significantly higher, demonstrating its ability to provide more accurate responses and reasoning. Conclusion This study highlights the potential of ChatGPT in medical licensing examinations and its advantage over random guessing. However, it is important to note that ChatGPT still falls short of human physicians in terms of diagnostic accuracy and decision-making capabilities. Caution should be exercised when using ChatGPT, and its results should be verified by human experts to ensure patient safety and avoid potential errors in the medical field. Data are available on reasonable request.","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"102 1","pages":""},"PeriodicalIF":4.1000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Health & Care Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjhci-2023-100815","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction Large language models such as ChatGPT have gained popularity for their ability to generate comprehensive responses to human queries. In the field of medicine, ChatGPT has shown promise in applications ranging from diagnostics to decision-making. However, its performance in medical examinations and its comparison to random guessing have not been extensively studied. Methods This study aimed to evaluate the performance of ChatGPT in the preinternship examination, a comprehensive medical assessment for students in Iran. The examination consisted of 200 multiple-choice questions categorised into basic science evaluation, diagnosis and decision-making. GPT-4 was used, and the questions were translated to English. A statistical analysis was conducted to assess the performance of ChatGPT and also compare it with a random test group. Results The results showed that ChatGPT performed exceptionally well, with 68.5% of the questions answered correctly, significantly surpassing the pass mark of 45%. It exhibited superior performance in decision-making and successfully passed all specialties. Comparing ChatGPT to the random test group, ChatGPT’s performance was significantly higher, demonstrating its ability to provide more accurate responses and reasoning. Conclusion This study highlights the potential of ChatGPT in medical licensing examinations and its advantage over random guessing. However, it is important to note that ChatGPT still falls short of human physicians in terms of diagnostic accuracy and decision-making capabilities. Caution should be exercised when using ChatGPT, and its results should be verified by human experts to ensure patient safety and avoid potential errors in the medical field. Data are available on reasonable request.

查看原文本刊更多论文

伊朗医学执照考试中的 ChatGPT：评估基于人工智能模型的诊断准确性和决策能力

引言大型语言模型（如 ChatGPT）因其能够生成对人类查询的全面回复而广受欢迎。在医学领域，从诊断到决策，ChatGPT 都显示出良好的应用前景。然而，它在医学检查中的表现以及与随机猜测的比较尚未得到广泛研究。方法本研究旨在评估 ChatGPT 在实习前考试中的表现，这是一项针对伊朗学生的综合医学评估。考试包括 200 道选择题，分为基础科学评估、诊断和决策。使用的是 GPT-4，试题被翻译成英语。为了评估 ChatGPT 的性能，并将其与随机测试组进行比较，我们进行了统计分析。结果结果显示，ChatGPT 的表现非常出色，68.5% 的问题回答正确，大大超过了 45% 的及格线。它在决策方面表现出色，并成功通过了所有专业测试。将 ChatGPT 与随机测试组相比，ChatGPT 的成绩明显更高，这表明它有能力提供更准确的回答和推理。结论本研究凸显了 ChatGPT 在医学执业资格考试中的潜力及其相对于随机猜测的优势。不过，需要注意的是，就诊断准确性和决策能力而言，ChatGPT 仍与人类医生存在差距。使用 ChatGPT 时应谨慎，其结果应由人类专家验证，以确保患者安全，避免医疗领域潜在的错误。如有合理要求，可提供相关数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊