Performance of trauma-trained large language models on surgical assessment questions: A new approach in resource identification.

IF 3.2 2区 医学 Q1 SURGERY
Surgery Pub Date : 2025-03-01 Epub Date: 2024-09-23 DOI:10.1016/j.surg.2024.08.026
Arnav Mahajan, Andrew Tran, Esther S Tseng, John J Como, Kevin M El-Hayek, Prerna Ladha, Vanessa P Ho
{"title":"Performance of trauma-trained large language models on surgical assessment questions: A new approach in resource identification.","authors":"Arnav Mahajan, Andrew Tran, Esther S Tseng, John J Como, Kevin M El-Hayek, Prerna Ladha, Vanessa P Ho","doi":"10.1016/j.surg.2024.08.026","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models have successfully navigated simulated medical board examination questions. However, whether and how language models can be used in surgical education is less understood. Our study evaluates the efficacy of domain-specific large language models in curating study materials for surgical board style questions.</p><p><strong>Methods: </strong>We developed EAST-GPT and ACS-GPT, custom large language models with domain-specific knowledge from published guidelines from the Eastern Association of the Surgery of Trauma and the American College of Surgeons Trauma Quality Programs. EAST-GPT, ACS-GPT, and an untrained GPT-4 performance were assessed trauma-related questions from Surgical Education and Self-Assessment Program (18th edition). Large language models were asked to choose answers and provide answer rationales. Rationales were assessed against an educational framework with 5 domains: accuracy, relevance, comprehensiveness, evidence-base, and clarity.</p><p><strong>Results: </strong>Ninety guidelines trained EAST-GPT and 10 trained ACS-GPT. All large language models were tested on 62 trauma questions. EAST-GPT correctly answered 76%, whereas ACS-GPT answered 68% correctly. Both models outperformed ChatGPT-4 (P < .05), which answered 45% correctly. For reasoning, EAST-GPT achieved the gratest mean scores across all 5 educational framework metrics. ACS-GPT scored lower than ChatGPT-4 in comprehensiveness and evidence-base; however, these differences were not statistically significant.</p><p><strong>Conclusion: </strong>Our study presents a novel methodology in identifying test-preparation resources by training a large language model to answer board-style multiple choice questions. Both trained models outperformed ChatGPT-4, demonstrating its answers were accurate, relevant, and evidence-based. Potential implications of such AI integration into surgical education must be explored.</p>","PeriodicalId":22152,"journal":{"name":"Surgery","volume":" ","pages":"108793"},"PeriodicalIF":3.2000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.surg.2024.08.026","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/23 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"SURGERY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Large language models have successfully navigated simulated medical board examination questions. However, whether and how language models can be used in surgical education is less understood. Our study evaluates the efficacy of domain-specific large language models in curating study materials for surgical board style questions.

Methods: We developed EAST-GPT and ACS-GPT, custom large language models with domain-specific knowledge from published guidelines from the Eastern Association of the Surgery of Trauma and the American College of Surgeons Trauma Quality Programs. EAST-GPT, ACS-GPT, and an untrained GPT-4 performance were assessed trauma-related questions from Surgical Education and Self-Assessment Program (18th edition). Large language models were asked to choose answers and provide answer rationales. Rationales were assessed against an educational framework with 5 domains: accuracy, relevance, comprehensiveness, evidence-base, and clarity.

Results: Ninety guidelines trained EAST-GPT and 10 trained ACS-GPT. All large language models were tested on 62 trauma questions. EAST-GPT correctly answered 76%, whereas ACS-GPT answered 68% correctly. Both models outperformed ChatGPT-4 (P < .05), which answered 45% correctly. For reasoning, EAST-GPT achieved the gratest mean scores across all 5 educational framework metrics. ACS-GPT scored lower than ChatGPT-4 in comprehensiveness and evidence-base; however, these differences were not statistically significant.

Conclusion: Our study presents a novel methodology in identifying test-preparation resources by training a large language model to answer board-style multiple choice questions. Both trained models outperformed ChatGPT-4, demonstrating its answers were accurate, relevant, and evidence-based. Potential implications of such AI integration into surgical education must be explored.

创伤训练大语言模型在手术评估问题上的表现:资源识别的新方法。
背景:大型语言模型已成功驾驭模拟医学考试试题。然而,人们对语言模型能否以及如何用于外科教育还不甚了解。我们的研究评估了针对特定领域的大型语言模型在为外科医师资格考试题目策划学习材料方面的功效:我们开发了 EAST-GPT 和 ACS-GPT,这些定制的大型语言模型具有特定领域的知识,这些知识来自东部创伤外科协会和美国外科学院创伤质量项目的已发布指南。对 EAST-GPT、ACS-GPT 和未经训练的 GPT-4 的表现进行了评估,评估内容为《外科教育与自我评估计划》(第 18 版)中与创伤相关的问题。要求大语言模型选择答案并提供答案理由。根据教育框架的 5 个领域对理由进行评估:准确性、相关性、全面性、证据基础和清晰度:结果:90 份指南接受了 EAST-GPT 训练,10 份指南接受了 ACS-GPT 训练。所有大型语言模型都在 62 个创伤问题上进行了测试。EAST-GPT 的正确率为 76%,而 ACS-GPT 的正确率为 68%。两种模型的表现都优于 ChatGPT-4(P < .05),后者的正确率为 45%。在推理方面,EAST-GPT 在所有 5 个教育框架指标中都获得了最优秀的平均分。在全面性和证据基础方面,ACS-GPT 的得分低于 ChatGPT-4;但是,这些差异在统计学上并不显著:我们的研究提出了一种新颖的方法,即通过训练大型语言模型来回答板书式选择题,从而识别备考资源。两个经过训练的模型的表现都优于 ChatGPT-4,表明其答案准确、相关且以证据为基础。必须探讨将这种人工智能整合到外科教育中的潜在影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Surgery
Surgery 医学-外科
CiteScore
5.40
自引率
5.30%
发文量
687
审稿时长
64 days
期刊介绍: For 66 years, Surgery has published practical, authoritative information about procedures, clinical advances, and major trends shaping general surgery. Each issue features original scientific contributions and clinical reports. Peer-reviewed articles cover topics in oncology, trauma, gastrointestinal, vascular, and transplantation surgery. The journal also publishes papers from the meetings of its sponsoring societies, the Society of University Surgeons, the Central Surgical Association, and the American Association of Endocrine Surgeons.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信