Performance of ChatGPT in the In-Training Examination for Anesthesiology and Pain Medicine Residents in South Korea: Observational Study.

IF 3.2 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

JMIR Medical Education Pub Date : 2024-09-16 DOI:10.2196/56859

Soo-Hyuk Yoon, Seok Kyeong Oh, Byung Gun Lim, Ho-Jin Lee

{"title":"Performance of ChatGPT in the In-Training Examination for Anesthesiology and Pain Medicine Residents in South Korea: Observational Study.","authors":"Soo-Hyuk Yoon, Seok Kyeong Oh, Byung Gun Lim, Ho-Jin Lee","doi":"10.2196/56859","DOIUrl":null,"url":null,"abstract":"Background: ChatGPT has been tested in health care, including the US Medical Licensing Examination and specialty exams, showing near-passing results. Its performance in the field of anesthesiology has been assessed using English board examination questions; however, its effectiveness in Korea remains unexplored.Objective: This study investigated the problem-solving performance of ChatGPT in the fields of anesthesiology and pain medicine in the Korean language context, highlighted advancements in artificial intelligence (AI), and explored its potential applications in medical education.Methods: We investigated the performance (number of correct answers/number of questions) of GPT-4, GPT-3.5, and CLOVA X in the fields of anesthesiology and pain medicine, using in-training examinations that have been administered to Korean anesthesiology residents over the past 5 years, with an annual composition of 100 questions. Questions containing images, diagrams, or photographs were excluded from the analysis. Furthermore, to assess the performance differences of the GPT across different languages, we conducted a comparative analysis of the GPT-4's problem-solving proficiency using both the original Korean texts and their English translations.Results: A total of 398 questions were analyzed. GPT-4 (67.8%) demonstrated a significantly better overall performance than GPT-3.5 (37.2%) and CLOVA-X (36.7%). However, GPT-3.5 and CLOVA X did not show significant differences in their overall performance. Additionally, the GPT-4 showed superior performance on questions translated into English, indicating a language processing discrepancy (English: 75.4% vs Korean: 67.8%; difference 7.5%; 95% CI 3.1%-11.9%; P=.001).Conclusions: This study underscores the potential of AI tools, such as ChatGPT, in medical education and practice but emphasizes the need for cautious application and further refinement, especially in non-English medical contexts. The findings suggest that although AI advancements are promising, they require careful evaluation and development to ensure acceptable performance across diverse linguistic and professional settings.","PeriodicalId":36236,"journal":{"name":"JMIR Medical Education","volume":"10 ","pages":"e56859"},"PeriodicalIF":3.2000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11443200/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/56859","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: ChatGPT has been tested in health care, including the US Medical Licensing Examination and specialty exams, showing near-passing results. Its performance in the field of anesthesiology has been assessed using English board examination questions; however, its effectiveness in Korea remains unexplored.

Objective: This study investigated the problem-solving performance of ChatGPT in the fields of anesthesiology and pain medicine in the Korean language context, highlighted advancements in artificial intelligence (AI), and explored its potential applications in medical education.

Methods: We investigated the performance (number of correct answers/number of questions) of GPT-4, GPT-3.5, and CLOVA X in the fields of anesthesiology and pain medicine, using in-training examinations that have been administered to Korean anesthesiology residents over the past 5 years, with an annual composition of 100 questions. Questions containing images, diagrams, or photographs were excluded from the analysis. Furthermore, to assess the performance differences of the GPT across different languages, we conducted a comparative analysis of the GPT-4's problem-solving proficiency using both the original Korean texts and their English translations.

Results: A total of 398 questions were analyzed. GPT-4 (67.8%) demonstrated a significantly better overall performance than GPT-3.5 (37.2%) and CLOVA-X (36.7%). However, GPT-3.5 and CLOVA X did not show significant differences in their overall performance. Additionally, the GPT-4 showed superior performance on questions translated into English, indicating a language processing discrepancy (English: 75.4% vs Korean: 67.8%; difference 7.5%; 95% CI 3.1%-11.9%; P=.001).

Conclusions: This study underscores the potential of AI tools, such as ChatGPT, in medical education and practice but emphasizes the need for cautious application and further refinement, especially in non-English medical contexts. The findings suggest that although AI advancements are promising, they require careful evaluation and development to ensure acceptable performance across diverse linguistic and professional settings.

查看原文本刊更多论文

韩国麻醉学和疼痛学住院医师培训考试中 ChatGPT 的表现：观察研究。

背景介绍ChatGPT 已在医疗保健领域（包括美国医疗执照考试和专科考试）进行过测试，结果显示接近及格。其在麻醉学领域的表现已通过英语委员会考试试题进行了评估；然而，其在韩国的有效性仍有待探索：本研究调查了 ChatGPT 在韩语环境下的麻醉学和疼痛医学领域的解题表现，强调了人工智能（AI）的进步，并探索了其在医学教育中的潜在应用：我们利用过去 5 年中对韩国麻醉学住院医师进行的培训考试，调查了 GPT-4、GPT-3.5 和 CLOVA X 在麻醉学和疼痛医学领域的表现（正确答案数/问题数），每年的考试题量为 100 道。含有图像、图表或照片的问题不在分析之列。此外，为了评估GPT在不同语言中的表现差异，我们使用韩文原文和英文译文对GPT-4的问题解决能力进行了比较分析：共分析了 398 个问题。GPT-4（67.8%）的整体表现明显优于GPT-3.5（37.2%）和CLOVA-X（36.7%）。然而，GPT-3.5 和 CLOVA X 的总体表现并无显著差异。此外，GPT-4 在翻译成英语的问题上表现优异，这表明存在语言处理差异（英语：75.4% vs 韩语：67.8%；差异 7.5%；95% CI 3.1%-11.9%；P=.001）：本研究强调了 ChatGPT 等人工智能工具在医学教育和实践中的潜力，但也强调了谨慎应用和进一步完善的必要性，尤其是在非英语医疗环境中。研究结果表明，虽然人工智能的发展前景广阔，但仍需仔细评估和开发，以确保在不同的语言和专业环境中都能发挥可接受的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊