{"title":"Performance analysis of large language models Chatgpt-4o, OpenAI O1, and OpenAI O3 mini in clinical treatment of pneumonia: a comparative study.","authors":"Zhiwu Lin, Yuanyuan Li, Min Wu, Hongmei Liu, Xiaoyang Song, Qian Yu, Guibao Xiao, Jiajun Xie","doi":"10.1007/s10238-025-01743-7","DOIUrl":null,"url":null,"abstract":"<p><p>This study aimed to compare the performance of three large language models (ChatGPT-4o, OpenAI O1, and OpenAI O3 mini) in delivering accurate and guideline compliant recommendations for pneumonia management. By assessing both general and guideline-focused questions, the investigation sought to elucidate each model's strengths, limitations, and capacity to self-correct in response to expert feedback. Fifty pneumonia-related questions (30 general, 20 guideline-based) were posed to the three models. Ten infectious disease specialists independently scored responses for accuracy using a 5-point scale. The two chain-of-thought models (OpenAI O1 and OpenAI O3 mini) were further tested for self-correction when initially rated \"poor,\" with re-evaluations conducted one week later to reduce recall bias. Statistical analyses included nonparametric tests, ANOVA, and Fleiss' Kappa for inter-rater reliability. OpenAI O1 achieved the highest overall accuracy, followed by OpenAI O3 mini; ChatGPT-4o scored lowest. For \"poor\" responses, O1 and O3 mini both significantly improved after targeted prompts, reflecting the advantages of chain-of-thought reasoning. ChatGPT-4o demonstrated limited gains upon re-prompting and provided more concise, but sometimes incomplete, information. OpenAI O1 and O3 mini offered superior guideline-aligned recommendations and benefited from self-correction capabilities, while ChatGPT-4o's direct-answer approach led to moderate or poor outcomes for complex pneumonia queries. Incorporating chain-of-thought mechanisms appears critical for refining clinical guidance. These findings suggest that advanced large language models can support pneumonia management by providing accurate, up-to-date information, particularly when equipped to iteratively refine their outputs in response to expert feedback.</p>","PeriodicalId":10337,"journal":{"name":"Clinical and Experimental Medicine","volume":"25 1","pages":"213"},"PeriodicalIF":3.5000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12181206/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical and Experimental Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10238-025-01743-7","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
引用次数: 0
Abstract
This study aimed to compare the performance of three large language models (ChatGPT-4o, OpenAI O1, and OpenAI O3 mini) in delivering accurate and guideline compliant recommendations for pneumonia management. By assessing both general and guideline-focused questions, the investigation sought to elucidate each model's strengths, limitations, and capacity to self-correct in response to expert feedback. Fifty pneumonia-related questions (30 general, 20 guideline-based) were posed to the three models. Ten infectious disease specialists independently scored responses for accuracy using a 5-point scale. The two chain-of-thought models (OpenAI O1 and OpenAI O3 mini) were further tested for self-correction when initially rated "poor," with re-evaluations conducted one week later to reduce recall bias. Statistical analyses included nonparametric tests, ANOVA, and Fleiss' Kappa for inter-rater reliability. OpenAI O1 achieved the highest overall accuracy, followed by OpenAI O3 mini; ChatGPT-4o scored lowest. For "poor" responses, O1 and O3 mini both significantly improved after targeted prompts, reflecting the advantages of chain-of-thought reasoning. ChatGPT-4o demonstrated limited gains upon re-prompting and provided more concise, but sometimes incomplete, information. OpenAI O1 and O3 mini offered superior guideline-aligned recommendations and benefited from self-correction capabilities, while ChatGPT-4o's direct-answer approach led to moderate or poor outcomes for complex pneumonia queries. Incorporating chain-of-thought mechanisms appears critical for refining clinical guidance. These findings suggest that advanced large language models can support pneumonia management by providing accurate, up-to-date information, particularly when equipped to iteratively refine their outputs in response to expert feedback.
期刊介绍:
Clinical and Experimental Medicine (CEM) is a multidisciplinary journal that aims to be a forum of scientific excellence and information exchange in relation to the basic and clinical features of the following fields: hematology, onco-hematology, oncology, virology, immunology, and rheumatology. The journal publishes reviews and editorials, experimental and preclinical studies, translational research, prospectively designed clinical trials, and epidemiological studies. Papers containing new clinical or experimental data that are likely to contribute to changes in clinical practice or the way in which a disease is thought about will be given priority due to their immediate importance. Case reports will be accepted on an exceptional basis only, and their submission is discouraged. The major criteria for publication are clarity, scientific soundness, and advances in knowledge. In compliance with the overwhelmingly prevailing request by the international scientific community, and with respect for eco-compatibility issues, CEM is now published exclusively online.