评估高级大型语言模型在医学知识中的有效性：使用日本国家医学考试的比较研究。

IF 3.7 2区医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

International Journal of Medical Informatics Pub Date : 2024-10-28 DOI:10.1016/j.ijmedinf.2024.105673

Mingxin Liu , Tsuyoshi Okuhara , Zhehao Dai , Wenbo Huang , Lin Gu , Hiroko Okada , Emi Furukawa , Takahiro Kiuchi

{"title":"评估高级大型语言模型在医学知识中的有效性：使用日本国家医学考试的比较研究。","authors":"Mingxin Liu , Tsuyoshi Okuhara , Zhehao Dai , Wenbo Huang , Lin Gu , Hiroko Okada , Emi Furukawa , Takahiro Kiuchi","doi":"10.1016/j.ijmedinf.2024.105673","DOIUrl":null,"url":null,"abstract":"<div><div>Study aims and objectives.</div><div>This study aims to evaluate the accuracy of medical knowledge in the most advanced LLMs (GPT-4o, GPT-4, Gemini 1.5 Pro, and Claude 3 Opus) as of 2024. It is the first to evaluate these LLMs using a non-English medical licensing exam. The insights from this study will guide educators, policymakers, and technical experts in the effective use of AI in medical education and clinical diagnosis.</div></div><div><h3>Method</h3><div>Authors inputted 790 questions from Japanese National Medical Examination into the chat windows of the LLMs to obtain responses. Two authors independently assessed the correctness. Authors analyzed the overall accuracy rates of the LLMs and compared their performance on image and non-image questions, questions of varying difficulty levels, general and clinical questions, and questions from different medical specialties. Additionally, authors examined the correlation between the number of publications and LLMs’ performance in different medical specialties.</div></div><div><h3>Results</h3><div>GPT-4o achieved highest accuracy rate of 89.2% and outperformed the other LLMs in overall performance and each specific category. All four LLMs performed better on non-image questions than image questions, with a 10% accuracy gap. They also performed better on easy questions compared to normal and difficult ones. GPT-4o achieved a 95.0% accuracy rate on easy questions, marking it as an effective knowledge source for medical education. Four LLMs performed worst on “Gastroenterology and Hepatology” specialty. There was a positive correlation between the number of publications and LLM performance in different specialties.</div></div><div><h3>Conclusions</h3><div>GPT-4o achieved an overall accuracy rate close to 90%, with 95.0% on easy questions, significantly outperforming the other LLMs. This indicates GPT-4o’s potential as a knowledge source for easy questions. Image-based questions and question difficulty significantly impact LLM accuracy. “Gastroenterology and Hepatology” is the specialty with the lowest performance. The LLMs’ performance across medical specialties correlates positively with the number of related publications.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"193 ","pages":"Article 105673"},"PeriodicalIF":3.7000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination\",\"authors\":\"Mingxin Liu , Tsuyoshi Okuhara , Zhehao Dai , Wenbo Huang , Lin Gu , Hiroko Okada , Emi Furukawa , Takahiro Kiuchi\",\"doi\":\"10.1016/j.ijmedinf.2024.105673\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Study aims and objectives.</div><div>This study aims to evaluate the accuracy of medical knowledge in the most advanced LLMs (GPT-4o, GPT-4, Gemini 1.5 Pro, and Claude 3 Opus) as of 2024. It is the first to evaluate these LLMs using a non-English medical licensing exam. The insights from this study will guide educators, policymakers, and technical experts in the effective use of AI in medical education and clinical diagnosis.</div></div><div><h3>Method</h3><div>Authors inputted 790 questions from Japanese National Medical Examination into the chat windows of the LLMs to obtain responses. Two authors independently assessed the correctness. Authors analyzed the overall accuracy rates of the LLMs and compared their performance on image and non-image questions, questions of varying difficulty levels, general and clinical questions, and questions from different medical specialties. Additionally, authors examined the correlation between the number of publications and LLMs’ performance in different medical specialties.</div></div><div><h3>Results</h3><div>GPT-4o achieved highest accuracy rate of 89.2% and outperformed the other LLMs in overall performance and each specific category. All four LLMs performed better on non-image questions than image questions, with a 10% accuracy gap. They also performed better on easy questions compared to normal and difficult ones. GPT-4o achieved a 95.0% accuracy rate on easy questions, marking it as an effective knowledge source for medical education. Four LLMs performed worst on “Gastroenterology and Hepatology” specialty. There was a positive correlation between the number of publications and LLM performance in different specialties.</div></div><div><h3>Conclusions</h3><div>GPT-4o achieved an overall accuracy rate close to 90%, with 95.0% on easy questions, significantly outperforming the other LLMs. This indicates GPT-4o’s potential as a knowledge source for easy questions. Image-based questions and question difficulty significantly impact LLM accuracy. “Gastroenterology and Hepatology” is the specialty with the lowest performance. The LLMs’ performance across medical specialties correlates positively with the number of related publications.</div></div>\",\"PeriodicalId\":54950,\"journal\":{\"name\":\"International Journal of Medical Informatics\",\"volume\":\"193 \",\"pages\":\"Article 105673\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-10-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1386505624003368\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505624003368","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

研究目的和目标。本研究旨在评估截至 2024 年最先进的 LLM（GPT-4o、GPT-4、Gemini 1.5 Pro 和 Claude 3 Opus）中医学知识的准确性。这是首次使用非英语医学执照考试来评估这些 LLM。本研究的见解将指导教育工作者、政策制定者和技术专家在医学教育和临床诊断中有效使用人工智能：方法：作者将日本国家医学考试中的 790 个问题输入法学硕士的聊天窗口，以获取回复。两名作者独立评估正确率。作者分析了 LLMs 的总体正确率，并比较了它们在图像和非图像问题、不同难度的问题、普通和临床问题以及不同医学专业问题上的表现。此外，作者还研究了发表论文的数量与 LLMs 在不同医学专业中的表现之间的相关性：结果：GPT-4o 的准确率最高，达到 89.2%，在整体表现和每个特定类别中都优于其他 LLM。所有四种 LLM 在非图像问题上的表现均优于图像问题，准确率差距为 10%。它们在简单问题上的表现也优于普通问题和难题。GPT-4o 在简单问题上的准确率达到 95.0%，是医学教育的有效知识来源。四名法学硕士在 "胃肠病学和肝病学 "专业的成绩最差。在不同专业中，发表论文的数量与法学硕士的表现呈正相关：结论：GPT-4o 的总体准确率接近 90%，简单问题的准确率为 95.0%，明显优于其他 LLM。这表明 GPT-4o 具有作为简单问题知识源的潜力。基于图像的问题和问题难度对 LLM 的准确性有很大影响。"胃肠病学和肝病学 "是成绩最低的专业。LLM 在各医学专业中的表现与相关出版物的数量呈正相关。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination

Study aims and objectives.

This study aims to evaluate the accuracy of medical knowledge in the most advanced LLMs (GPT-4o, GPT-4, Gemini 1.5 Pro, and Claude 3 Opus) as of 2024. It is the first to evaluate these LLMs using a non-English medical licensing exam. The insights from this study will guide educators, policymakers, and technical experts in the effective use of AI in medical education and clinical diagnosis.

Method

Authors inputted 790 questions from Japanese National Medical Examination into the chat windows of the LLMs to obtain responses. Two authors independently assessed the correctness. Authors analyzed the overall accuracy rates of the LLMs and compared their performance on image and non-image questions, questions of varying difficulty levels, general and clinical questions, and questions from different medical specialties. Additionally, authors examined the correlation between the number of publications and LLMs’ performance in different medical specialties.

Results

GPT-4o achieved highest accuracy rate of 89.2% and outperformed the other LLMs in overall performance and each specific category. All four LLMs performed better on non-image questions than image questions, with a 10% accuracy gap. They also performed better on easy questions compared to normal and difficult ones. GPT-4o achieved a 95.0% accuracy rate on easy questions, marking it as an effective knowledge source for medical education. Four LLMs performed worst on “Gastroenterology and Hepatology” specialty. There was a positive correlation between the number of publications and LLM performance in different specialties.

Conclusions

GPT-4o achieved an overall accuracy rate close to 90%, with 95.0% on easy questions, significantly outperforming the other LLMs. This indicates GPT-4o’s potential as a knowledge source for easy questions. Image-based questions and question difficulty significantly impact LLM accuracy. “Gastroenterology and Hepatology” is the specialty with the lowest performance. The LLMs’ performance across medical specialties correlates positively with the number of related publications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Medical Informatics 医学-计算机：信息系统

CiteScore

8.90

自引率

4.10%

发文量

217

审稿时长

42 days

期刊介绍： International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings. The scope of journal covers: Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.; Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc. Educational computer based programs pertaining to medical informatics or medicine in general; Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.