Siyin Guo BS , Genpeng Li MD , Juxiang Gou BS , Yanping Gong MD , Wanjun Zhao MD , Zhiqiang Li MS , Xianwei Yang MD , Zhenni Liu MS , Zhihui Li MD , Jianyong Lei MD
{"title":"The Performance of ChatGPT-4.0 and ChatGPT-4omni on Answering Thyroid Question: A Multicenter Study","authors":"Siyin Guo BS , Genpeng Li MD , Juxiang Gou BS , Yanping Gong MD , Wanjun Zhao MD , Zhiqiang Li MS , Xianwei Yang MD , Zhenni Liu MS , Zhihui Li MD , Jianyong Lei MD","doi":"10.1016/j.jss.2025.06.066","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Although ChatGPT-4.0 exhibits increasing potential in medical applications, its more recent version, ChatGPT-4omni, has not yet been evaluated for how well it responds to patient questions on thyroid health. In this study, the performance of ChatGPT-4.0 and ChatGPT-4omni in answering questions on the thyroid was examined.</div></div><div><h3>Methods</h3><div>To test the performance of ChatGPT-4.0 and ChatGPT-4omni, we first obtained 28 thyroid-related questions from the Huayitong app, a convenient medical app that was officially released by West China Hospital of Sichuan University. We also added two interventional questions to the total of 30 questions. On June 28, 2024, we entered these queries into ChatGPT-4.0 and ChatGPT-4omni in Chinese to generate 60 Chinese replies. Finally, from July 1 to 15, 2024, we asked 60 patients, 29 surgeons, and 37 nurses from 21 tertiary care units nationwide to rate the two sources’ responses on a 5-point Likert scale in terms of time, word count, response speed, accuracy, comprehensiveness, empathy, and satisfaction.</div></div><div><h3>Results</h3><div>When answering 30 questions, ChatGPT-4omni answered more words (437.30 [110.20] characters <em>versus</em> 750.50 [611.50-817.25] characters; <em>P</em> < 0.001), took less time to respond (27.58 [7.22] seconds <em>versus</em> 20.68 [4.38] seconds; <em>P</em> < 0.001), and was faster (15.69 [13.90–16.92]) character/second <em>versus</em> 34.26 [5.03] character/second; <em>P</em> < 0.001) than ChatGPT-4.0. Responses from ChatGPT-4omni were rated as more accurate, comprehensive, sympathetic, and satisfied than those from ChatGPT-4.0 by patients, thyroid surgeons, and thyroid surgery nurses (all <em>P</em> values < 0.05).</div></div><div><h3>Conclusions</h3><div>ChatGPT-4omni outperformed ChatGPT-4.0 in answering common thyroid-related questions. However, further study and optimization are needed to achieve an efficient integration of ChatGPT in clinical settings.</div></div>","PeriodicalId":17030,"journal":{"name":"Journal of Surgical Research","volume":"313 ","pages":"Pages 500-508"},"PeriodicalIF":1.7000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Surgical Research","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0022480425004044","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SURGERY","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction
Although ChatGPT-4.0 exhibits increasing potential in medical applications, its more recent version, ChatGPT-4omni, has not yet been evaluated for how well it responds to patient questions on thyroid health. In this study, the performance of ChatGPT-4.0 and ChatGPT-4omni in answering questions on the thyroid was examined.
Methods
To test the performance of ChatGPT-4.0 and ChatGPT-4omni, we first obtained 28 thyroid-related questions from the Huayitong app, a convenient medical app that was officially released by West China Hospital of Sichuan University. We also added two interventional questions to the total of 30 questions. On June 28, 2024, we entered these queries into ChatGPT-4.0 and ChatGPT-4omni in Chinese to generate 60 Chinese replies. Finally, from July 1 to 15, 2024, we asked 60 patients, 29 surgeons, and 37 nurses from 21 tertiary care units nationwide to rate the two sources’ responses on a 5-point Likert scale in terms of time, word count, response speed, accuracy, comprehensiveness, empathy, and satisfaction.
Results
When answering 30 questions, ChatGPT-4omni answered more words (437.30 [110.20] characters versus 750.50 [611.50-817.25] characters; P < 0.001), took less time to respond (27.58 [7.22] seconds versus 20.68 [4.38] seconds; P < 0.001), and was faster (15.69 [13.90–16.92]) character/second versus 34.26 [5.03] character/second; P < 0.001) than ChatGPT-4.0. Responses from ChatGPT-4omni were rated as more accurate, comprehensive, sympathetic, and satisfied than those from ChatGPT-4.0 by patients, thyroid surgeons, and thyroid surgery nurses (all P values < 0.05).
Conclusions
ChatGPT-4omni outperformed ChatGPT-4.0 in answering common thyroid-related questions. However, further study and optimization are needed to achieve an efficient integration of ChatGPT in clinical settings.
期刊介绍:
The Journal of Surgical Research: Clinical and Laboratory Investigation publishes original articles concerned with clinical and laboratory investigations relevant to surgical practice and teaching. The journal emphasizes reports of clinical investigations or fundamental research bearing directly on surgical management that will be of general interest to a broad range of surgeons and surgical researchers. The articles presented need not have been the products of surgeons or of surgical laboratories.
The Journal of Surgical Research also features review articles and special articles relating to educational, research, or social issues of interest to the academic surgical community.