Performance of a Large Language Model on Japanese Emergency Medicine Board Certification Examinations.

IF 1.4 4区医学 Q2 MEDICINE, GENERAL & INTERNAL

Journal of Nippon Medical School Pub Date : 2024-05-21 Epub Date: 2024-03-02 DOI:10.1272/jnms.JNMS.2024_91-205

Yutaka Igarashi, Kyoichi Nakahara, Tatsuya Norii, Nodoka Miyake, Takashi Tagami, Shoji Yokobori

{"title":"Performance of a Large Language Model on Japanese Emergency Medicine Board Certification Examinations.","authors":"Yutaka Igarashi, Kyoichi Nakahara, Tatsuya Norii, Nodoka Miyake, Takashi Tagami, Shoji Yokobori","doi":"10.1272/jnms.JNMS.2024_91-205","DOIUrl":null,"url":null,"abstract":"Background: Emergency physicians need a broad range of knowledge and skills to address critical medical, traumatic, and environmental conditions. Artificial intelligence (AI), including large language models (LLMs), has potential applications in healthcare settings; however, the performance of LLMs in emergency medicine remains unclear.Methods: To evaluate the reliability of information provided by ChatGPT, an LLM was given the questions set by the Japanese Association of Acute Medicine in its board certification examinations over a period of 5 years (2018-2022) and programmed to answer them twice. Statistical analysis was used to assess agreement of the two responses.Results: The LLM successfully answered 465 of the 475 text-based questions, achieving an overall correct response rate of 62.3%. For questions without images, the rate of correct answers was 65.9%. For questions with images that were not explained to the LLM, the rate of correct answers was only 52.0%. The annual rates of correct answers to questions without images ranged from 56.3% to 78.8%. Accuracy was better for scenario-based questions (69.1%) than for stand-alone questions (62.1%). Agreement between the two responses was substantial (kappa = 0.70). Factual error accounted for 82% of the incorrectly answered questions.Conclusion: An LLM performed satisfactorily on an emergency medicine board certification examination in Japanese and without images. However, factual errors in the responses highlight the need for physician oversight when using LLMs.","PeriodicalId":56076,"journal":{"name":"Journal of Nippon Medical School","volume":" ","pages":"155-161"},"PeriodicalIF":1.4000,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Nippon Medical School","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1272/jnms.JNMS.2024_91-205","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/3/2 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Emergency physicians need a broad range of knowledge and skills to address critical medical, traumatic, and environmental conditions. Artificial intelligence (AI), including large language models (LLMs), has potential applications in healthcare settings; however, the performance of LLMs in emergency medicine remains unclear.

Methods: To evaluate the reliability of information provided by ChatGPT, an LLM was given the questions set by the Japanese Association of Acute Medicine in its board certification examinations over a period of 5 years (2018-2022) and programmed to answer them twice. Statistical analysis was used to assess agreement of the two responses.

Results: The LLM successfully answered 465 of the 475 text-based questions, achieving an overall correct response rate of 62.3%. For questions without images, the rate of correct answers was 65.9%. For questions with images that were not explained to the LLM, the rate of correct answers was only 52.0%. The annual rates of correct answers to questions without images ranged from 56.3% to 78.8%. Accuracy was better for scenario-based questions (69.1%) than for stand-alone questions (62.1%). Agreement between the two responses was substantial (kappa = 0.70). Factual error accounted for 82% of the incorrectly answered questions.

Conclusion: An LLM performed satisfactorily on an emergency medicine board certification examination in Japanese and without images. However, factual errors in the responses highlight the need for physician oversight when using LLMs.

查看原文本刊更多论文

大型语言模型在日本急诊医学委员会认证考试中的表现。

背景急诊医生需要广泛的知识和技能来应对危急的医疗、创伤和环境状况。人工智能（AI），包括大型语言模型（LLMs），在医疗环境中具有潜在的应用价值；然而，LLMs 在急诊医学中的表现仍不明确。方法为了评估 ChatGPT 所提供信息的可靠性，向一名 LLM 提供了日本急诊医学协会在其董事会认证考试中设置的问题，为期 5 年（2018-2022 年），并通过编程让其回答两次。结果在 475 道基于文本的问题中，法学硕士成功回答了 465 道，总体正确率为 62.3%。对于没有图片的问题，正确率为 65.9%。对于有图像但未向 LLM 解释的问题，正确率仅为 52.0%。无图像问题的年正确率为 56.3% 至 78.8%。基于情景的问题（69.1%）的正确率高于独立问题（62.1%）。两种回答之间的一致性很高（kappa = 0.70）。在回答错误的问题中，事实错误占 82%。然而，答题中的事实错误凸显了医生在使用 LLM 时进行监督的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Nippon Medical School MEDICINE, GENERAL & INTERNAL-

CiteScore

1.80

自引率

10.00%

发文量

118

期刊介绍： The international effort to understand, treat and control disease involve clinicians and researchers from many medical and biological science disciplines. The Journal of Nippon Medical School (JNMS) is the official journal of the Medical Association of Nippon Medical School and is dedicated to furthering international exchange of medical science experience and opinion. It provides an international forum for researchers in the fields of bascic and clinical medicine to introduce, discuss and exchange thier novel achievements in biomedical science and a platform for the worldwide dissemination and steering of biomedical knowledge for the benefit of human health and welfare. Properly reasoned discussions disciplined by appropriate references to existing bodies of knowledge or aimed at motivating the creation of such knowledge is the aim of the journal.