Performance analysis of large language models Chatgpt-4o, OpenAI O1, and OpenAI O3 mini in clinical treatment of pneumonia: a comparative study.

IF 3.5 4区 医学 Q2 MEDICINE, RESEARCH & EXPERIMENTAL
Zhiwu Lin, Yuanyuan Li, Min Wu, Hongmei Liu, Xiaoyang Song, Qian Yu, Guibao Xiao, Jiajun Xie
{"title":"Performance analysis of large language models Chatgpt-4o, OpenAI O1, and OpenAI O3 mini in clinical treatment of pneumonia: a comparative study.","authors":"Zhiwu Lin, Yuanyuan Li, Min Wu, Hongmei Liu, Xiaoyang Song, Qian Yu, Guibao Xiao, Jiajun Xie","doi":"10.1007/s10238-025-01743-7","DOIUrl":null,"url":null,"abstract":"<p><p>This study aimed to compare the performance of three large language models (ChatGPT-4o, OpenAI O1, and OpenAI O3 mini) in delivering accurate and guideline compliant recommendations for pneumonia management. By assessing both general and guideline-focused questions, the investigation sought to elucidate each model's strengths, limitations, and capacity to self-correct in response to expert feedback. Fifty pneumonia-related questions (30 general, 20 guideline-based) were posed to the three models. Ten infectious disease specialists independently scored responses for accuracy using a 5-point scale. The two chain-of-thought models (OpenAI O1 and OpenAI O3 mini) were further tested for self-correction when initially rated \"poor,\" with re-evaluations conducted one week later to reduce recall bias. Statistical analyses included nonparametric tests, ANOVA, and Fleiss' Kappa for inter-rater reliability. OpenAI O1 achieved the highest overall accuracy, followed by OpenAI O3 mini; ChatGPT-4o scored lowest. For \"poor\" responses, O1 and O3 mini both significantly improved after targeted prompts, reflecting the advantages of chain-of-thought reasoning. ChatGPT-4o demonstrated limited gains upon re-prompting and provided more concise, but sometimes incomplete, information. OpenAI O1 and O3 mini offered superior guideline-aligned recommendations and benefited from self-correction capabilities, while ChatGPT-4o's direct-answer approach led to moderate or poor outcomes for complex pneumonia queries. Incorporating chain-of-thought mechanisms appears critical for refining clinical guidance. These findings suggest that advanced large language models can support pneumonia management by providing accurate, up-to-date information, particularly when equipped to iteratively refine their outputs in response to expert feedback.</p>","PeriodicalId":10337,"journal":{"name":"Clinical and Experimental Medicine","volume":"25 1","pages":"213"},"PeriodicalIF":3.5000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12181206/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical and Experimental Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10238-025-01743-7","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
引用次数: 0

Abstract

This study aimed to compare the performance of three large language models (ChatGPT-4o, OpenAI O1, and OpenAI O3 mini) in delivering accurate and guideline compliant recommendations for pneumonia management. By assessing both general and guideline-focused questions, the investigation sought to elucidate each model's strengths, limitations, and capacity to self-correct in response to expert feedback. Fifty pneumonia-related questions (30 general, 20 guideline-based) were posed to the three models. Ten infectious disease specialists independently scored responses for accuracy using a 5-point scale. The two chain-of-thought models (OpenAI O1 and OpenAI O3 mini) were further tested for self-correction when initially rated "poor," with re-evaluations conducted one week later to reduce recall bias. Statistical analyses included nonparametric tests, ANOVA, and Fleiss' Kappa for inter-rater reliability. OpenAI O1 achieved the highest overall accuracy, followed by OpenAI O3 mini; ChatGPT-4o scored lowest. For "poor" responses, O1 and O3 mini both significantly improved after targeted prompts, reflecting the advantages of chain-of-thought reasoning. ChatGPT-4o demonstrated limited gains upon re-prompting and provided more concise, but sometimes incomplete, information. OpenAI O1 and O3 mini offered superior guideline-aligned recommendations and benefited from self-correction capabilities, while ChatGPT-4o's direct-answer approach led to moderate or poor outcomes for complex pneumonia queries. Incorporating chain-of-thought mechanisms appears critical for refining clinical guidance. These findings suggest that advanced large language models can support pneumonia management by providing accurate, up-to-date information, particularly when equipped to iteratively refine their outputs in response to expert feedback.

大型语言模型chatgpt - 40、OpenAI O1、OpenAI O3 mini在肺炎临床治疗中的性能分析:比较研究
本研究旨在比较三种大型语言模型(chatgpt - 40、OpenAI O1和OpenAI O3 mini)在提供准确和符合指南的肺炎管理建议方面的性能。通过评估一般问题和以指导方针为中心的问题,调查试图阐明每个模型的优势、局限性以及根据专家反馈进行自我纠正的能力。向三个模型提出了50个与肺炎相关的问题(30个一般性问题,20个基于指南的问题)。10名传染病专家使用5分制对回答的准确性进行独立评分。这两种思维链模型(OpenAI O1和OpenAI O3 mini)在最初被评为“差”时进行了进一步的自我纠正测试,一周后进行了重新评估,以减少回忆偏差。统计分析包括非参数检验、方差分析和Fleiss Kappa量表间信度。OpenAI O1的整体准确率最高,其次是OpenAI O3 mini;chatgpt - 40得分最低。对于“差”的回答,有针对性的提示后,O1和O3 mini都有了明显的改善,体现了思维链推理的优势。chatgpt - 40在重新提示时显示出有限的收益,并且提供了更简洁,但有时不完整的信息。OpenAI O1和O3 mini提供了卓越的指南一致的建议,并受益于自我纠正功能,而chatgpt - 40的直接回答方法导致了复杂肺炎查询的中等或较差结果。整合思维链机制对于完善临床指导至关重要。这些发现表明,先进的大型语言模型可以通过提供准确、最新的信息来支持肺炎管理,特别是在配备了能够根据专家反馈迭代改进其输出的设备时。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Clinical and Experimental Medicine
Clinical and Experimental Medicine 医学-医学:研究与实验
CiteScore
4.80
自引率
2.20%
发文量
159
审稿时长
2.5 months
期刊介绍: Clinical and Experimental Medicine (CEM) is a multidisciplinary journal that aims to be a forum of scientific excellence and information exchange in relation to the basic and clinical features of the following fields: hematology, onco-hematology, oncology, virology, immunology, and rheumatology. The journal publishes reviews and editorials, experimental and preclinical studies, translational research, prospectively designed clinical trials, and epidemiological studies. Papers containing new clinical or experimental data that are likely to contribute to changes in clinical practice or the way in which a disease is thought about will be given priority due to their immediate importance. Case reports will be accepted on an exceptional basis only, and their submission is discouraged. The major criteria for publication are clarity, scientific soundness, and advances in knowledge. In compliance with the overwhelmingly prevailing request by the international scientific community, and with respect for eco-compatibility issues, CEM is now published exclusively online.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信