GastroGPT:开发和控制概念验证定制临床语言模型的测试。

IF 2.3 Q3 GASTROENTEROLOGY & HEPATOLOGY
Endoscopy International Open Pub Date : 2025-08-06 eCollection Date: 2025-01-01 DOI:10.1055/a-2637-2163
Cem Simsek, Mete Ucdal, Enrique de-Madaria, Alanna Ebigbo, Petr Vanek, Omar Elshaarawy, Theodor Alexandru Voiosu, Giulio Antonelli, Román Turró, Javier P Gisbert, Olga P Nyssen, Cesare Hassan, Helmut Messmann, Rajiv Jalan
{"title":"GastroGPT:开发和控制概念验证定制临床语言模型的测试。","authors":"Cem Simsek, Mete Ucdal, Enrique de-Madaria, Alanna Ebigbo, Petr Vanek, Omar Elshaarawy, Theodor Alexandru Voiosu, Giulio Antonelli, Román Turró, Javier P Gisbert, Olga P Nyssen, Cesare Hassan, Helmut Messmann, Rajiv Jalan","doi":"10.1055/a-2637-2163","DOIUrl":null,"url":null,"abstract":"<p><strong>Background and study aims: </strong>Current general-purpose artificial intelligence (AI) large language models (LLMs) demonstrate limited efficacy in clinical medicine, often constrained to question-answering, documentation, and literature summarization roles. We developed GastroGPT, a proof-of-concept specialty-specific, multi-task, clinical LLM, and evaluated its performance against leading general-purpose LLMs across key gastroenterology tasks and diverse case scenarios.</p><p><strong>Methods: </strong>In this structured analysis, GastroGPT was compared with three state-of-the-art general-purpose LLMs (LLM-A: GPT-4, LLM-B: Bard, LLM-C: Claude). Models were assessed on seven clinical tasks and overall performance across 10 simulated gastroenterology cases varying in complexity, frequency, and patient demographics. Standardized prompts facilitated structured comparisons. A blinded expert panel rated model outputs per task on a 10-point Likert scale, judging clinical utility. Comprehensive statistical analyses were conducted.</p><p><strong>Results: </strong>A total of 2,240 expert ratings were obtained. GastroGPT achieved significantly higher mean overall scores (8.1 ± 1.8) compared with GPT-4 (5.2 ± 3.0), Bard (5.7 ± 3.3), and Claude (7.0 ± 2.7) (all <i>P</i> < 0.001). It outperformed comparators in six of seven tasks ( <i>P</i> < 0.05), except follow-up planning. GastroGPT demonstrated superior score consistency (variance 34.95) versus general models (97.4-260.35) ( <i>P</i> < 0.001). Its performance remained consistent across case complexities and frequencies, unlike the comparators ( <i>P</i> < 0.001). Multivariate analysis revealed that model type significantly predicted performance ( <i>P</i> < 0.001).</p><p><strong>Conclusions: </strong>This study pioneered development and comparison of a specialty-specific, clinically-oriented AI model to general-purpose LLMs. GastroGPT demonstrated superior utility overall and on key gastroenterology tasks, highlighting the potential for tailored, task-focused AI models in medicine.</p>","PeriodicalId":11671,"journal":{"name":"Endoscopy International Open","volume":"13 ","pages":"a26372163"},"PeriodicalIF":2.3000,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12371664/pdf/","citationCount":"0","resultStr":"{\"title\":\"GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model.\",\"authors\":\"Cem Simsek, Mete Ucdal, Enrique de-Madaria, Alanna Ebigbo, Petr Vanek, Omar Elshaarawy, Theodor Alexandru Voiosu, Giulio Antonelli, Román Turró, Javier P Gisbert, Olga P Nyssen, Cesare Hassan, Helmut Messmann, Rajiv Jalan\",\"doi\":\"10.1055/a-2637-2163\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background and study aims: </strong>Current general-purpose artificial intelligence (AI) large language models (LLMs) demonstrate limited efficacy in clinical medicine, often constrained to question-answering, documentation, and literature summarization roles. We developed GastroGPT, a proof-of-concept specialty-specific, multi-task, clinical LLM, and evaluated its performance against leading general-purpose LLMs across key gastroenterology tasks and diverse case scenarios.</p><p><strong>Methods: </strong>In this structured analysis, GastroGPT was compared with three state-of-the-art general-purpose LLMs (LLM-A: GPT-4, LLM-B: Bard, LLM-C: Claude). Models were assessed on seven clinical tasks and overall performance across 10 simulated gastroenterology cases varying in complexity, frequency, and patient demographics. Standardized prompts facilitated structured comparisons. A blinded expert panel rated model outputs per task on a 10-point Likert scale, judging clinical utility. Comprehensive statistical analyses were conducted.</p><p><strong>Results: </strong>A total of 2,240 expert ratings were obtained. GastroGPT achieved significantly higher mean overall scores (8.1 ± 1.8) compared with GPT-4 (5.2 ± 3.0), Bard (5.7 ± 3.3), and Claude (7.0 ± 2.7) (all <i>P</i> < 0.001). It outperformed comparators in six of seven tasks ( <i>P</i> < 0.05), except follow-up planning. GastroGPT demonstrated superior score consistency (variance 34.95) versus general models (97.4-260.35) ( <i>P</i> < 0.001). Its performance remained consistent across case complexities and frequencies, unlike the comparators ( <i>P</i> < 0.001). Multivariate analysis revealed that model type significantly predicted performance ( <i>P</i> < 0.001).</p><p><strong>Conclusions: </strong>This study pioneered development and comparison of a specialty-specific, clinically-oriented AI model to general-purpose LLMs. GastroGPT demonstrated superior utility overall and on key gastroenterology tasks, highlighting the potential for tailored, task-focused AI models in medicine.</p>\",\"PeriodicalId\":11671,\"journal\":{\"name\":\"Endoscopy International Open\",\"volume\":\"13 \",\"pages\":\"a26372163\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2025-08-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12371664/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Endoscopy International Open\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1055/a-2637-2163\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q3\",\"JCRName\":\"GASTROENTEROLOGY & HEPATOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Endoscopy International Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1055/a-2637-2163","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

背景和研究目的:目前的通用人工智能(AI)大语言模型(llm)在临床医学中的功效有限,通常局限于问答、文档和文献总结的角色。我们开发了胃肠法学硕士(GastroGPT),这是一种概念验证型的多任务临床法学硕士,并对其在关键胃肠病学任务和不同病例情况下与领先的通用法学硕士的表现进行了评估。方法:在这个结构化分析中,将GastroGPT与三种最先进的通用llm (LLM-A: GPT-4, LLM-B: Bard, LLM-C: Claude)进行比较。模型在7项临床任务和10个模拟胃肠病学病例的总体表现上进行了评估,这些病例在复杂性、频率和患者人口统计学上有所不同。标准化的提示有助于结构化的比较。一个盲法专家小组对每项任务的模型输出进行10分李克特评分,以判断临床效用。进行综合统计分析。结果:共获得专家评分2240份。与GPT-4评分(5.2±3.0)、Bard评分(5.7±3.3)和Claude评分(7.0±2.7)相比,GastroGPT评分(8.1±1.8)显著高于GPT-4评分(5.2±3.0)、Bard评分(5.7±3.3)和Claude评分(7.0±2.7)(均P < 0.001)。除随访计划外,7项任务中有6项优于比较国家(P < 0.05)。与一般模型(97.4-260.35)相比,GastroGPT表现出更好的评分一致性(方差34.95)(P < 0.001)。不同于比较器(P < 0.001),它的性能在案件复杂性和频率上保持一致。多变量分析显示,模型类型显著预测绩效(P < 0.001)。结论:这项研究开创了一种专门的、面向临床的人工智能模型与通用法学硕士的开发和比较。GastroGPT在胃肠病学的关键任务中表现出了卓越的实用性,突出了医学领域量身定制的、以任务为中心的人工智能模型的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model.

GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model.

GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model.

GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model.

Background and study aims: Current general-purpose artificial intelligence (AI) large language models (LLMs) demonstrate limited efficacy in clinical medicine, often constrained to question-answering, documentation, and literature summarization roles. We developed GastroGPT, a proof-of-concept specialty-specific, multi-task, clinical LLM, and evaluated its performance against leading general-purpose LLMs across key gastroenterology tasks and diverse case scenarios.

Methods: In this structured analysis, GastroGPT was compared with three state-of-the-art general-purpose LLMs (LLM-A: GPT-4, LLM-B: Bard, LLM-C: Claude). Models were assessed on seven clinical tasks and overall performance across 10 simulated gastroenterology cases varying in complexity, frequency, and patient demographics. Standardized prompts facilitated structured comparisons. A blinded expert panel rated model outputs per task on a 10-point Likert scale, judging clinical utility. Comprehensive statistical analyses were conducted.

Results: A total of 2,240 expert ratings were obtained. GastroGPT achieved significantly higher mean overall scores (8.1 ± 1.8) compared with GPT-4 (5.2 ± 3.0), Bard (5.7 ± 3.3), and Claude (7.0 ± 2.7) (all P < 0.001). It outperformed comparators in six of seven tasks ( P < 0.05), except follow-up planning. GastroGPT demonstrated superior score consistency (variance 34.95) versus general models (97.4-260.35) ( P < 0.001). Its performance remained consistent across case complexities and frequencies, unlike the comparators ( P < 0.001). Multivariate analysis revealed that model type significantly predicted performance ( P < 0.001).

Conclusions: This study pioneered development and comparison of a specialty-specific, clinically-oriented AI model to general-purpose LLMs. GastroGPT demonstrated superior utility overall and on key gastroenterology tasks, highlighting the potential for tailored, task-focused AI models in medicine.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Endoscopy International Open
Endoscopy International Open GASTROENTEROLOGY & HEPATOLOGY-
自引率
3.80%
发文量
270
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信