Comparative Analysis of LLMs' Performance On a Practice Radiography Certification Exam.

IF 0.5 Q4 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Radiologic Technology Pub Date : 2025-05-01

Kevin R Clark

{"title":"Comparative Analysis of LLMs' Performance On a Practice Radiography Certification Exam.","authors":"Kevin R Clark","doi":"","DOIUrl":null,"url":null,"abstract":"Purpose: To compare the performance of multiple large language models (LLMs) on a practice radiography certification exam.Method: Using an exploratory, nonexperimental approach, 200 multiple-choice question stems and options (correct answers and distractors) from a practice radiography certification exam were entered into 5 LLMs: ChatGPT (OpenAI), Claude (Anthropic), Copilot (Microsoft), Gemini (Google), and Perplexity (Perplexity AI). Responses were recorded as correct or incorrect, and overall accuracy rates were calculated for each LLM. McNemar tests determined if there were significant differences between accuracy rates. Performance also was evaluated and aggregated by content categories and subcategories.Results: ChatGPT had the highest overall accuracy of 83.5%, followed by Perplexity (78.9%), Copilot (78.0%), Gemini (75.0%), and Claude (71.0%). ChatGPT had a significantly higher accuracy rate than did Claude (P , .001) and Gemini (P 5 .02). Regarding content categories, ChatGPT was the only LLM to correctly answer all 38 patient care questions. In addition, ChatGPT had the highest number of correct responses in the areas of safety (38/48, 79.2%) and procedures (50/59, 84.7%). Copilot had the highest number of correct responses in the area of image production (43/55, 78.2%). ChatGPT also achieved superior accuracy in 4 of the 8 subcategories.Discussion: Findings from this study provide valuable insights into the performance of multiple LLMs in answering practice radiography certification exam questions. Although ChatGPT emerged as the most accurate LLM for this practice exam, caution should be exercised when using generative artificial intelligence (AI) models. Because LLMs can generate false and incorrect information, responses must be checked for accuracy, and the models should be corrected when inaccurate responses are given.Conclusion: Among the 5 LLMs compared in this study, ChatGPT was the most accurate model. As interest in generative AI continues to increase and new language applications become readily available, users should understand the limitations of LLMs and check responses for accuracy. Future research could include additional practice exams in other primary pathways, including magnetic resonance imaging, nuclear medicine technology, radiation therapy, and sonography.","PeriodicalId":51772,"journal":{"name":"Radiologic Technology","volume":"96 5","pages":"334-342"},"PeriodicalIF":0.5000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiologic Technology","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: To compare the performance of multiple large language models (LLMs) on a practice radiography certification exam.

Method: Using an exploratory, nonexperimental approach, 200 multiple-choice question stems and options (correct answers and distractors) from a practice radiography certification exam were entered into 5 LLMs: ChatGPT (OpenAI), Claude (Anthropic), Copilot (Microsoft), Gemini (Google), and Perplexity (Perplexity AI). Responses were recorded as correct or incorrect, and overall accuracy rates were calculated for each LLM. McNemar tests determined if there were significant differences between accuracy rates. Performance also was evaluated and aggregated by content categories and subcategories.

Results: ChatGPT had the highest overall accuracy of 83.5%, followed by Perplexity (78.9%), Copilot (78.0%), Gemini (75.0%), and Claude (71.0%). ChatGPT had a significantly higher accuracy rate than did Claude (P , .001) and Gemini (P 5 .02). Regarding content categories, ChatGPT was the only LLM to correctly answer all 38 patient care questions. In addition, ChatGPT had the highest number of correct responses in the areas of safety (38/48, 79.2%) and procedures (50/59, 84.7%). Copilot had the highest number of correct responses in the area of image production (43/55, 78.2%). ChatGPT also achieved superior accuracy in 4 of the 8 subcategories.

Discussion: Findings from this study provide valuable insights into the performance of multiple LLMs in answering practice radiography certification exam questions. Although ChatGPT emerged as the most accurate LLM for this practice exam, caution should be exercised when using generative artificial intelligence (AI) models. Because LLMs can generate false and incorrect information, responses must be checked for accuracy, and the models should be corrected when inaccurate responses are given.

Conclusion: Among the 5 LLMs compared in this study, ChatGPT was the most accurate model. As interest in generative AI continues to increase and new language applications become readily available, users should understand the limitations of LLMs and check responses for accuracy. Future research could include additional practice exams in other primary pathways, including magnetic resonance imaging, nuclear medicine technology, radiation therapy, and sonography.

本刊更多论文

法学硕士在执业放射学认证考试中的表现比较分析。

目的：比较多个大型语言模型（llm）在执业放射学认证考试中的表现。方法：采用探索性的非实验方法，将放射学执业认证考试中的200个选择题和选项（正确答案和干扰因素）输入5个llm: ChatGPT (OpenAI), Claude (Anthropic), Copilot (Microsoft), Gemini（谷歌）和Perplexity （Perplexity AI）。回答被记录为正确或不正确，并计算每个LLM的总体准确率。McNemar测试确定准确率之间是否存在显著差异。性能也通过内容类别和子类别进行评估和汇总。结果：ChatGPT的总体准确率最高，为83.5%，其次是Perplexity（78.9%）、Copilot（78.0%）、Gemini（75.0%）和Claude（71.0%）。ChatGPT的准确率显著高于Claude (P，。001)和Gemini （P < 0.05）。关于内容类别，ChatGPT是唯一正确回答所有38个患者护理问题的法学硕士。此外，ChatGPT在安全性（38/48,79.2%）和程序（50/59,84.7%）方面的正确率最高。副驾驶在图像生成方面的正确率最高（43/55,78.2%）。ChatGPT在8个子类别中的4个子类别中也取得了更高的准确性。讨论：本研究的结果为多个llm在回答执业放射学认证考试问题方面的表现提供了有价值的见解。虽然ChatGPT是本次实践考试中最准确的法学硕士，但在使用生成式人工智能（AI）模型时应谨慎行事。由于法学模型可能产生虚假和不正确的信息，因此必须检查响应的准确性，并且在给出不准确的响应时应纠正模型。结论：在本研究比较的5种LLMs中，ChatGPT是最准确的模型。随着对生成式人工智能的兴趣不断增加，新的语言应用程序变得容易获得，用户应该了解法学硕士的局限性，并检查响应的准确性。未来的研究可能包括其他主要途径的额外实践考试，包括磁共振成像、核医学技术、放射治疗和超声检查。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Radiologic Technology RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING-

CiteScore

1.00

自引率

12.50%

发文量

期刊介绍： Radiologic Technology is an official scholarly journal of the American Society of Radiologic Technologists. Published continuously since 1929, it circulates to more than 145,000 readers worldwide. This award-winning bimonthly Journal covers all disciplines and specialties within medical imaging, including radiography, mammography, computed tomography, magnetic resonance imaging, nuclear medicine imaging, sonography and cardiovascular-interventional radiography. In addition to peer-reviewed research articles, Radi