Evaluating Large Language Models on Aerospace Medicine Principles.

IF 1.1 4区医学 Q3 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH

Wilderness & Environmental Medicine Pub Date : 2025-09-01 Epub Date: 2025-04-28 DOI:10.1177/10806032251330628

Kyle D Anderson, Cole A Davis, Shawn M Pickett, Michael S Pohlen

{"title":"Evaluating Large Language Models on Aerospace Medicine Principles.","authors":"Kyle D Anderson, Cole A Davis, Shawn M Pickett, Michael S Pohlen","doi":"10.1177/10806032251330628","DOIUrl":null,"url":null,"abstract":"IntroductionLarge language models (LLMs) hold immense potential to serve as clinical decision-support tools for Earth-independent medical operations. However, the generation of incorrect information may be misleading or even harmful when applied to care in this setting.MethodTo better understand this risk, this work tested two publicly available LLMs, ChatGPT-4 and Google Gemini Advanced (1.0 Ultra), as well as a custom Retrieval-Augmented Generation (RAG) LLM on factual knowledge and clinical reasoning in accordance with published material in aerospace medicine. We also evaluated the consistency of the two public LLMs when answering self-generated board-style questions.ResultsWhen queried with 857 free-response questions from Aerospace Medicine Boards Questions and Answers, ChatGPT-4 had a mean reader score from 4.23 to 5.00 (Likert scale of 1-5) across chapters, whereas Gemini Advanced and the RAG LLM scored 3.30 to 4.91 and 4.69 to 5.00, respectively. When queried with 20 multiple-choice aerospace medicine board questions provided by the American College of Preventive Medicine, ChatGPT-4 and Gemini Advanced responded correctly 70% and 55% of the time, respectively, while the RAG LLM answered 85% correctly. Despite this quantitative measure of high performance, the LLMs tested still exhibited gaps in factual knowledge that potentially could be harmful, a degree of clinical reasoning that may not pass the aerospace medicine board exam, and some inconsistency when answering self-generated questions.ConclusionThere is considerable promise for LLM use in autonomous medical operations in spaceflight given the anticipated continued rapid pace of development, including advancements in model training, data quality, and fine-tuning methods.","PeriodicalId":49360,"journal":{"name":"Wilderness & Environmental Medicine","volume":" ","pages":"44S-52S"},"PeriodicalIF":1.1000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Wilderness & Environmental Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/10806032251330628","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/28 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}

引用次数: 0

Abstract

IntroductionLarge language models (LLMs) hold immense potential to serve as clinical decision-support tools for Earth-independent medical operations. However, the generation of incorrect information may be misleading or even harmful when applied to care in this setting.MethodTo better understand this risk, this work tested two publicly available LLMs, ChatGPT-4 and Google Gemini Advanced (1.0 Ultra), as well as a custom Retrieval-Augmented Generation (RAG) LLM on factual knowledge and clinical reasoning in accordance with published material in aerospace medicine. We also evaluated the consistency of the two public LLMs when answering self-generated board-style questions.ResultsWhen queried with 857 free-response questions from Aerospace Medicine Boards Questions and Answers, ChatGPT-4 had a mean reader score from 4.23 to 5.00 (Likert scale of 1-5) across chapters, whereas Gemini Advanced and the RAG LLM scored 3.30 to 4.91 and 4.69 to 5.00, respectively. When queried with 20 multiple-choice aerospace medicine board questions provided by the American College of Preventive Medicine, ChatGPT-4 and Gemini Advanced responded correctly 70% and 55% of the time, respectively, while the RAG LLM answered 85% correctly. Despite this quantitative measure of high performance, the LLMs tested still exhibited gaps in factual knowledge that potentially could be harmful, a degree of clinical reasoning that may not pass the aerospace medicine board exam, and some inconsistency when answering self-generated questions.ConclusionThere is considerable promise for LLM use in autonomous medical operations in spaceflight given the anticipated continued rapid pace of development, including advancements in model training, data quality, and fine-tuning methods.

查看原文本刊更多论文

航空航天医学原理的大型语言模型评价。

大型语言模型（llm）具有巨大的潜力，可以作为独立于地球的医疗操作的临床决策支持工具。然而，在这种情况下，产生不正确的信息可能会产生误导，甚至有害。方法为了更好地理解这一风险，本工作根据航空航天医学出版的材料，测试了两个公开可用的LLM， ChatGPT-4和谷歌Gemini Advanced (1.0 Ultra)，以及一个定制的检索增强生成（RAG） LLM，用于事实知识和临床推理。我们还评估了两个公共法学硕士在回答自我产生的董事会式问题时的一致性。结果在航空航天医学委员会问答中询问857个自由回答问题时，ChatGPT-4各章节的平均读者得分为4.23至5.00（李克特量表1-5），而Gemini Advanced和RAG LLM的得分分别为3.30至4.91和4.69至5.00。当被问及由美国预防医学学院提供的20道航空航天医学选择题时，ChatGPT-4和Gemini Advanced的正确率分别为70%和55%，而RAG LLM的正确率为85%。尽管采用了这种高绩效的定量衡量标准，但llm在实际知识方面仍然存在潜在的有害差距，一定程度的临床推理可能无法通过航空航天医学委员会的考试，并且在回答自我生成的问题时存在一些不一致。考虑到预计将持续快速发展，包括模型训练、数据质量和微调方法的进步，LLM在航天自主医疗操作中的应用前景可观。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Wilderness & Environmental Medicine 医学-公共卫生、环境卫生与职业卫生

CiteScore

2.10

自引率

7.10%

发文量

审稿时长

>12 weeks

期刊介绍： Wilderness & Environmental Medicine, the official journal of the Wilderness Medical Society, is the leading journal for physicians practicing medicine in austere environments. This quarterly journal features articles on all aspects of wilderness medicine, including high altitude and climbing, cold- and heat-related phenomena, natural environmental disasters, immersion and near-drowning, diving, and barotrauma, hazardous plants/animals/insects/marine animals, animal attacks, search and rescue, ethical and legal issues, aeromedial transport, survival physiology, medicine in remote environments, travel medicine, operational medicine, and wilderness trauma management. It presents original research and clinical reports from scientists and practitioners around the globe. WEM invites submissions from authors who want to take advantage of our established publication''s unique scope, wide readership, and international recognition in the field of wilderness medicine. Its readership is a diverse group of medical and outdoor professionals who choose WEM as their primary wilderness medical resource.