Multiple large language models versus clinical guidelines for postmenopausal osteoporosis: a comparative study of ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Google Gemini, Google Gemini Advanced, and Microsoft Copilot
{"title":"Multiple large language models versus clinical guidelines for postmenopausal osteoporosis: a comparative study of ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Google Gemini, Google Gemini Advanced, and Microsoft Copilot","authors":"Chun-Ru Lin, Yi-Jun Chen, Po-An Tsai, Wen-Yuan Hsieh, Sung Huang Laurent Tsai, Tsai-Sheng Fu, Po-Liang Lai, Jau-Yuan Chen","doi":"10.1007/s11657-025-01587-4","DOIUrl":null,"url":null,"abstract":"<div><h3>\n <i>Summary</i>\n </h3><p>The study assesses the performance of AI models in evaluating postmenopausal osteoporosis. We found that ChatGPT-4o produced the most appropriate responses, highlighting the potential of AI to enhance clinical decision-making and improve patient care in osteoporosis management.</p><h3>Purpose</h3><p>The rise of artificial intelligence (AI) offers the potential for assisting clinical decisions. This study aims to assess the accuracy of various artificial intelligence models in providing recommendations for the diagnosis and treatment of postmenopausal osteoporosis.</p><h3>Methods</h3><p>Using questions from the 2020 American Association of Clinical Endocrinologists (AACE) guidelines for osteoporosis, AI models including ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Gemini, Gemini Advanced, and Copilot were prompted. Responses were classified as accurate if they did not contradict the clinical guidelines. Two additional categories, over-conclusive and insufficient, were created to further evaluate responses. Over-conclusive was designated if AI models provided recommendations not specified in the guidelines, while insufficient indicated a failure to provide relevant information included in the guidelines. Chi-square tests were employed to compare categorical outcomes among different AI models.</p><h3>Results</h3><p>A total of 42 clinical questions were evaluated. ChatGPT-4o achieved an accuracy of 88%, ChatGPT-3.5 57.1%, ChatGPT-4.0 64.3%, Gemini 45.2%, Gemini Advanced 57.1%, and Copilot 47.6% (<i>p</i> < 0.001).</p><h3>Conclusions</h3><p>The study reveals significant response accuracy variations across each AI model, with ChatGPT-4o demonstrating the highest accuracy. Further research is necessary to explore the broader applicability of AI in the medical domains.</p></div>","PeriodicalId":8283,"journal":{"name":"Archives of Osteoporosis","volume":"20 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Archives of Osteoporosis","FirstCategoryId":"3","ListUrlMain":"https://link.springer.com/article/10.1007/s11657-025-01587-4","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}
引用次数: 0
Abstract
Summary
The study assesses the performance of AI models in evaluating postmenopausal osteoporosis. We found that ChatGPT-4o produced the most appropriate responses, highlighting the potential of AI to enhance clinical decision-making and improve patient care in osteoporosis management.
Purpose
The rise of artificial intelligence (AI) offers the potential for assisting clinical decisions. This study aims to assess the accuracy of various artificial intelligence models in providing recommendations for the diagnosis and treatment of postmenopausal osteoporosis.
Methods
Using questions from the 2020 American Association of Clinical Endocrinologists (AACE) guidelines for osteoporosis, AI models including ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Gemini, Gemini Advanced, and Copilot were prompted. Responses were classified as accurate if they did not contradict the clinical guidelines. Two additional categories, over-conclusive and insufficient, were created to further evaluate responses. Over-conclusive was designated if AI models provided recommendations not specified in the guidelines, while insufficient indicated a failure to provide relevant information included in the guidelines. Chi-square tests were employed to compare categorical outcomes among different AI models.
Results
A total of 42 clinical questions were evaluated. ChatGPT-4o achieved an accuracy of 88%, ChatGPT-3.5 57.1%, ChatGPT-4.0 64.3%, Gemini 45.2%, Gemini Advanced 57.1%, and Copilot 47.6% (p < 0.001).
Conclusions
The study reveals significant response accuracy variations across each AI model, with ChatGPT-4o demonstrating the highest accuracy. Further research is necessary to explore the broader applicability of AI in the medical domains.
期刊介绍:
Archives of Osteoporosis is an international multidisciplinary journal which is a joint initiative of the International Osteoporosis Foundation and the National Osteoporosis Foundation of the USA. The journal will highlight the specificities of different regions around the world concerning epidemiology, reference values for bone density and bone metabolism, as well as clinical aspects of osteoporosis and other bone diseases.