评估增强放射学专业检查的大型语言模型：与人类表现的比较研究。

IF 3.8 2区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Academic Radiology Pub Date : 2025-05-27 DOI:10.1016/j.acra.2025.05.023

Hao-Yun Liu, Shyh-Jye Chen, Weichung Wang, Chung-Hsi Lee, Hsian-He Hsu, Shu-Huei Shen, Hong-Jen Chiou, Wen-Jeng Lee

{"title":"评估增强放射学专业检查的大型语言模型：与人类表现的比较研究。","authors":"Hao-Yun Liu, Shyh-Jye Chen, Weichung Wang, Chung-Hsi Lee, Hsian-He Hsu, Shu-Huei Shen, Hong-Jen Chiou, Wen-Jeng Lee","doi":"10.1016/j.acra.2025.05.023","DOIUrl":null,"url":null,"abstract":"Rationale and objectives: The radiology specialty examination assesses clinical decision-making, image interpretation, and diagnostic reasoning. With the expansion of medical knowledge, traditional test design faces challenges in maintaining accuracy and relevance. Large language models (LLMs) demonstrate potential in medical education. This study evaluates LLM performance in radiology specialty exams, explores their role in assessing question difficulty, and investigates their reasoning processes, aiming to develop a more objective and efficient framework for exam design.Materials and methods: This study compared the performance of LLMs and human examinees in a radiology specialty examination. Three LLMs (GPT-4o, o1-preview, and GPT-3.5-turbo-1106) were evaluated under zero-shot conditions. Exam accuracy, examinee accuracy, discrimination index, and point-biserial correlation were used to assess LLMs' ability to predict question difficulty and reasoning processes. The data provided by the Taiwan Radiological Society ensures comparability between AI and human performance.Results: As for accuracy, GPT-4o (88.0%) and o1-preview (90.9%) outperformed human examinees (76.3%), whereas GPT-3.5-turbo-1106 showed significantly lower accuracy (50.2%). Question difficulty analysis revealed that newer LLMs excel in solving complex questions, while GPT-3.5-turbo-1106 exhibited greater performance variability. Discrimination index and point-biserial Correlation analyses demonstrated that GPT-4o and o1-preview accurately identified key differentiating questions, closely mirroring human reasoning patterns. These findings suggest that advanced LLMs can assess medical examination difficulty, offering potential applications in exam standardization and question evaluation.Conclusion: This study evaluated the problem-solving capabilities of GPT-3.5-turbo-1106, GPT-4o, and o1-preview in a radiology specialty examination. LLMs should be utilized as tools for assessing exam question difficulty and assisting in the standardized development of medical examinations.","PeriodicalId":50928,"journal":{"name":"Academic Radiology","volume":" ","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Large Language Models for Enhancing Radiology Specialty Examination: A Comparative Study with Human Performance.\",\"authors\":\"Hao-Yun Liu, Shyh-Jye Chen, Weichung Wang, Chung-Hsi Lee, Hsian-He Hsu, Shu-Huei Shen, Hong-Jen Chiou, Wen-Jeng Lee\",\"doi\":\"10.1016/j.acra.2025.05.023\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Rationale and objectives: The radiology specialty examination assesses clinical decision-making, image interpretation, and diagnostic reasoning. With the expansion of medical knowledge, traditional test design faces challenges in maintaining accuracy and relevance. Large language models (LLMs) demonstrate potential in medical education. This study evaluates LLM performance in radiology specialty exams, explores their role in assessing question difficulty, and investigates their reasoning processes, aiming to develop a more objective and efficient framework for exam design.Materials and methods: This study compared the performance of LLMs and human examinees in a radiology specialty examination. Three LLMs (GPT-4o, o1-preview, and GPT-3.5-turbo-1106) were evaluated under zero-shot conditions. Exam accuracy, examinee accuracy, discrimination index, and point-biserial correlation were used to assess LLMs' ability to predict question difficulty and reasoning processes. The data provided by the Taiwan Radiological Society ensures comparability between AI and human performance.Results: As for accuracy, GPT-4o (88.0%) and o1-preview (90.9%) outperformed human examinees (76.3%), whereas GPT-3.5-turbo-1106 showed significantly lower accuracy (50.2%). Question difficulty analysis revealed that newer LLMs excel in solving complex questions, while GPT-3.5-turbo-1106 exhibited greater performance variability. Discrimination index and point-biserial Correlation analyses demonstrated that GPT-4o and o1-preview accurately identified key differentiating questions, closely mirroring human reasoning patterns. These findings suggest that advanced LLMs can assess medical examination difficulty, offering potential applications in exam standardization and question evaluation.Conclusion: This study evaluated the problem-solving capabilities of GPT-3.5-turbo-1106, GPT-4o, and o1-preview in a radiology specialty examination. LLMs should be utilized as tools for assessing exam question difficulty and assisting in the standardized development of medical examinations.\",\"PeriodicalId\":50928,\"journal\":{\"name\":\"Academic Radiology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-05-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Academic Radiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.acra.2025.05.023\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Academic Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.acra.2025.05.023","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

摘要

理由和目的：放射学专业检查评估临床决策，图像解释和诊断推理。随着医学知识的扩展，传统的测试设计在保持准确性和相关性方面面临挑战。大型语言模型（llm）在医学教育中显示出潜力。本研究评估了LLM在放射学专业考试中的表现，探讨了他们在评估问题难度方面的作用，并调查了他们的推理过程，旨在为考试设计制定一个更客观、更有效的框架。材料和方法：本研究比较了llm和人类考生在放射学专业考试中的表现。三种llm （gpt - 40, 01 -preview和GPT-3.5-turbo-1106）在零射击条件下进行评估。采用考试准确性、考生准确性、判别指数和点双列相关性来评估法学硕士预测问题难度和推理过程的能力。台湾放射学会提供的数据确保了人工智能和人类表现之间的可比性。结果：在正确率方面，gpt - 40（88.0%）和01 -preview（90.9%）优于人类考生（76.3%），而GPT-3.5-turbo-1106的正确率明显低于人类考生（50.2%）。问题难度分析显示，新一代法学硕士在解决复杂问题方面表现出色，而GPT-3.5-turbo-1106表现出更大的性能变化。判别指数和点双列相关分析表明，gpt - 40和01 -preview能够准确识别关键的判别问题，与人类推理模式非常接近。研究结果提示，高级法学硕士可以评估医学考试难度，在考试标准化和试题评价方面具有潜在的应用价值。结论：本研究评估了GPT-3.5-turbo-1106、gpt - 40和o1-preview在放射学专业检查中的解决问题能力。法学硕士应作为评估考试题目难度和协助医学考试规范化发展的工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating Large Language Models for Enhancing Radiology Specialty Examination: A Comparative Study with Human Performance.

Rationale and objectives: The radiology specialty examination assesses clinical decision-making, image interpretation, and diagnostic reasoning. With the expansion of medical knowledge, traditional test design faces challenges in maintaining accuracy and relevance. Large language models (LLMs) demonstrate potential in medical education. This study evaluates LLM performance in radiology specialty exams, explores their role in assessing question difficulty, and investigates their reasoning processes, aiming to develop a more objective and efficient framework for exam design.

Materials and methods: This study compared the performance of LLMs and human examinees in a radiology specialty examination. Three LLMs (GPT-4o, o1-preview, and GPT-3.5-turbo-1106) were evaluated under zero-shot conditions. Exam accuracy, examinee accuracy, discrimination index, and point-biserial correlation were used to assess LLMs' ability to predict question difficulty and reasoning processes. The data provided by the Taiwan Radiological Society ensures comparability between AI and human performance.

Results: As for accuracy, GPT-4o (88.0%) and o1-preview (90.9%) outperformed human examinees (76.3%), whereas GPT-3.5-turbo-1106 showed significantly lower accuracy (50.2%). Question difficulty analysis revealed that newer LLMs excel in solving complex questions, while GPT-3.5-turbo-1106 exhibited greater performance variability. Discrimination index and point-biserial Correlation analyses demonstrated that GPT-4o and o1-preview accurately identified key differentiating questions, closely mirroring human reasoning patterns. These findings suggest that advanced LLMs can assess medical examination difficulty, offering potential applications in exam standardization and question evaluation.

Conclusion: This study evaluated the problem-solving capabilities of GPT-3.5-turbo-1106, GPT-4o, and o1-preview in a radiology specialty examination. LLMs should be utilized as tools for assessing exam question difficulty and assisting in the standardized development of medical examinations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Academic Radiology 医学-核医学

CiteScore

7.60

自引率

10.40%

发文量

432

审稿时长

18 days

期刊介绍： Academic Radiology publishes original reports of clinical and laboratory investigations in diagnostic imaging, the diagnostic use of radioactive isotopes, computed tomography, positron emission tomography, magnetic resonance imaging, ultrasound, digital subtraction angiography, image-guided interventions and related techniques. It also includes brief technical reports describing original observations, techniques, and instrumental developments; state-of-the-art reports on clinical issues, new technology and other topics of current medical importance; meta-analyses; scientific studies and opinions on radiologic education; and letters to the Editor.