Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study.

IF 3.2 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

JMIR Medical Education Pub Date : 2024-10-08 DOI:10.2196/56128

Anthony James Goodings, Sten Kajitani, Allison Chhor, Ahmad Albakri, Mila Pastrak, Megha Kodancha, Rowan Ives, Yoo Bin Lee, Kari Kajitani

{"title":"Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study.","authors":"Anthony James Goodings, Sten Kajitani, Allison Chhor, Ahmad Albakri, Mila Pastrak, Megha Kodancha, Rowan Ives, Yoo Bin Lee, Kari Kajitani","doi":"10.2196/56128","DOIUrl":null,"url":null,"abstract":"Background: This research explores the capabilities of ChatGPT-4 in passing the American Board of Family Medicine (ABFM) Certification Examination. Addressing a gap in existing literature, where earlier artificial intelligence (AI) models showed limitations in medical board examinations, this study evaluates the enhanced features and potential of ChatGPT-4, especially in document analysis and information synthesis.Objective: The primary goal is to assess whether ChatGPT-4, when provided with extensive preparation resources and when using sophisticated data analysis, can achieve a score equal to or above the passing threshold for the Family Medicine Board Examinations.Methods: In this study, ChatGPT-4 was embedded in a specialized subenvironment, \"AI Family Medicine Board Exam Taker,\" designed to closely mimic the conditions of the ABFM Certification Examination. This subenvironment enabled the AI to access and analyze a range of relevant study materials, including a primary medical textbook and supplementary web-based resources. The AI was presented with a series of ABFM-type examination questions, reflecting the breadth and complexity typical of the examination. Emphasis was placed on assessing the AI's ability to interpret and respond to these questions accurately, leveraging its advanced data processing and analysis capabilities within this controlled subenvironment.Results: In our study, ChatGPT-4's performance was quantitatively assessed on 300 practice ABFM examination questions. The AI achieved a correct response rate of 88.67% (95% CI 85.08%-92.25%) for the Custom Robot version and 87.33% (95% CI 83.57%-91.10%) for the Regular version. Statistical analysis, including the McNemar test (P=.45), indicated no significant difference in accuracy between the 2 versions. In addition, the chi-square test for error-type distribution (P=.32) revealed no significant variation in the pattern of errors across versions. These results highlight ChatGPT-4's capacity for high-level performance and consistency in responding to complex medical examination questions under controlled conditions.Conclusions: The study demonstrates that ChatGPT-4, particularly when equipped with specialized preparation and when operating in a tailored subenvironment, shows promising potential in handling the intricacies of medical board examinations. While its performance is comparable with the expected standards for passing the ABFM Certification Examination, further enhancements in AI technology and tailored training methods could push these capabilities to new heights. This exploration opens avenues for integrating AI tools such as ChatGPT-4 in medical education and assessment, emphasizing the importance of continuous advancement and specialized training in medical applications of AI.","PeriodicalId":36236,"journal":{"name":"JMIR Medical Education","volume":"10 ","pages":"e56128"},"PeriodicalIF":3.2000,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11479358/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/56128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: This research explores the capabilities of ChatGPT-4 in passing the American Board of Family Medicine (ABFM) Certification Examination. Addressing a gap in existing literature, where earlier artificial intelligence (AI) models showed limitations in medical board examinations, this study evaluates the enhanced features and potential of ChatGPT-4, especially in document analysis and information synthesis.

Objective: The primary goal is to assess whether ChatGPT-4, when provided with extensive preparation resources and when using sophisticated data analysis, can achieve a score equal to or above the passing threshold for the Family Medicine Board Examinations.

Methods: In this study, ChatGPT-4 was embedded in a specialized subenvironment, "AI Family Medicine Board Exam Taker," designed to closely mimic the conditions of the ABFM Certification Examination. This subenvironment enabled the AI to access and analyze a range of relevant study materials, including a primary medical textbook and supplementary web-based resources. The AI was presented with a series of ABFM-type examination questions, reflecting the breadth and complexity typical of the examination. Emphasis was placed on assessing the AI's ability to interpret and respond to these questions accurately, leveraging its advanced data processing and analysis capabilities within this controlled subenvironment.

Results: In our study, ChatGPT-4's performance was quantitatively assessed on 300 practice ABFM examination questions. The AI achieved a correct response rate of 88.67% (95% CI 85.08%-92.25%) for the Custom Robot version and 87.33% (95% CI 83.57%-91.10%) for the Regular version. Statistical analysis, including the McNemar test (P=.45), indicated no significant difference in accuracy between the 2 versions. In addition, the chi-square test for error-type distribution (P=.32) revealed no significant variation in the pattern of errors across versions. These results highlight ChatGPT-4's capacity for high-level performance and consistency in responding to complex medical examination questions under controlled conditions.

Conclusions: The study demonstrates that ChatGPT-4, particularly when equipped with specialized preparation and when operating in a tailored subenvironment, shows promising potential in handling the intricacies of medical board examinations. While its performance is comparable with the expected standards for passing the ABFM Certification Examination, further enhancements in AI technology and tailored training methods could push these capabilities to new heights. This exploration opens avenues for integrating AI tools such as ChatGPT-4 in medical education and assessment, emphasizing the importance of continuous advancement and specialized training in medical applications of AI.

查看原文本刊更多论文

使用先进的人工智能学习和分析方法评估全科医学考试中的 ChatGPT-4：观察研究。

背景：本研究探讨了 ChatGPT-4 在通过美国全科医学委员会（ABFM）认证考试方面的能力。早期的人工智能（AI）模型在医学委员会考试中表现出局限性，本研究针对现有文献中的这一空白，评估了 ChatGPT-4 的增强功能和潜力，尤其是在文档分析和信息综合方面：主要目的是评估 ChatGPT-4 在获得大量备考资源和使用复杂数据分析的情况下，能否达到或超过全科医学考试合格分数线：在这项研究中，ChatGPT-4 被嵌入到一个专门的子环境 "人工智能全科医学委员会考试者 "中，该子环境的设计非常接近 ABFM 认证考试的条件。该子环境使人工智能能够访问和分析一系列相关的学习材料，包括一本主要的医学教科书和基于网络的补充资源。向人工智能提出了一系列 ABFM 类型的考试问题，反映了考试的典型广度和复杂性。重点是评估人工智能准确解释和回答这些问题的能力，在这个受控的子环境中利用其先进的数据处理和分析能力：在我们的研究中，我们对 ChatGPT-4 在 300 道 ABFM 考试练习题上的表现进行了量化评估。自定义机器人版本的人工智能正确回答率为 88.67%（95% CI 85.08%-92.25%），常规版本的正确回答率为 87.33%（95% CI 83.57%-91.10%）。包括 McNemar 检验（P=.45）在内的统计分析显示，两个版本的准确率没有显著差异。此外，错误类型分布的卡方检验（P=.32）显示，不同版本的错误模式没有明显差异。这些结果凸显了 ChatGPT-4 在可控条件下回答复杂体检问题时的高水平表现和一致性：本研究表明，ChatGPT-4，尤其是在配备了专业准备和在量身定制的子环境中运行时，在处理错综复杂的医学考试方面表现出了巨大的潜力。虽然它的表现与通过 ABFM 认证考试的预期标准相当，但人工智能技术和量身定制的培训方法的进一步提升可以将这些能力推向新的高度。这一探索为将 ChatGPT-4 等人工智能工具整合到医学教育和评估中开辟了道路，强调了在人工智能医学应用方面不断进步和专业培训的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊