绝经后骨质疏松症的多种大语言模型与临床指南:ChatGPT-3.5、ChatGPT-4.0、chatgpt - 40、谷歌Gemini、谷歌Gemini Advanced和Microsoft Copilot的比较研究

IF 2.8 3区 医学 Q2 ENDOCRINOLOGY & METABOLISM
Chun-Ru Lin, Yi-Jun Chen, Po-An Tsai, Wen-Yuan Hsieh, Sung Huang Laurent Tsai, Tsai-Sheng Fu, Po-Liang Lai, Jau-Yuan Chen
{"title":"绝经后骨质疏松症的多种大语言模型与临床指南:ChatGPT-3.5、ChatGPT-4.0、chatgpt - 40、谷歌Gemini、谷歌Gemini Advanced和Microsoft Copilot的比较研究","authors":"Chun-Ru Lin,&nbsp;Yi-Jun Chen,&nbsp;Po-An Tsai,&nbsp;Wen-Yuan Hsieh,&nbsp;Sung Huang Laurent Tsai,&nbsp;Tsai-Sheng Fu,&nbsp;Po-Liang Lai,&nbsp;Jau-Yuan Chen","doi":"10.1007/s11657-025-01587-4","DOIUrl":null,"url":null,"abstract":"<div><h3>\n <i>Summary</i>\n </h3><p>The study assesses the performance of AI models in evaluating postmenopausal osteoporosis. We found that ChatGPT-4o produced the most appropriate responses, highlighting the potential of AI to enhance clinical decision-making and improve patient care in osteoporosis management.</p><h3>Purpose</h3><p>The rise of artificial intelligence (AI) offers the potential for assisting clinical decisions. This study aims to assess the accuracy of various artificial intelligence models in providing recommendations for the diagnosis and treatment of postmenopausal osteoporosis.</p><h3>Methods</h3><p>Using questions from the 2020 American Association of Clinical Endocrinologists (AACE) guidelines for osteoporosis, AI models including ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Gemini, Gemini Advanced, and Copilot were prompted. Responses were classified as accurate if they did not contradict the clinical guidelines. Two additional categories, over-conclusive and insufficient, were created to further evaluate responses. Over-conclusive was designated if AI models provided recommendations not specified in the guidelines, while insufficient indicated a failure to provide relevant information included in the guidelines. Chi-square tests were employed to compare categorical outcomes among different AI models.</p><h3>Results</h3><p>A total of 42 clinical questions were evaluated. ChatGPT-4o achieved an accuracy of 88%, ChatGPT-3.5 57.1%, ChatGPT-4.0 64.3%, Gemini 45.2%, Gemini Advanced 57.1%, and Copilot 47.6% (<i>p</i> &lt; 0.001).</p><h3>Conclusions</h3><p>The study reveals significant response accuracy variations across each AI model, with ChatGPT-4o demonstrating the highest accuracy. Further research is necessary to explore the broader applicability of AI in the medical domains.</p></div>","PeriodicalId":8283,"journal":{"name":"Archives of Osteoporosis","volume":"20 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multiple large language models versus clinical guidelines for postmenopausal osteoporosis: a comparative study of ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Google Gemini, Google Gemini Advanced, and Microsoft Copilot\",\"authors\":\"Chun-Ru Lin,&nbsp;Yi-Jun Chen,&nbsp;Po-An Tsai,&nbsp;Wen-Yuan Hsieh,&nbsp;Sung Huang Laurent Tsai,&nbsp;Tsai-Sheng Fu,&nbsp;Po-Liang Lai,&nbsp;Jau-Yuan Chen\",\"doi\":\"10.1007/s11657-025-01587-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>\\n <i>Summary</i>\\n </h3><p>The study assesses the performance of AI models in evaluating postmenopausal osteoporosis. We found that ChatGPT-4o produced the most appropriate responses, highlighting the potential of AI to enhance clinical decision-making and improve patient care in osteoporosis management.</p><h3>Purpose</h3><p>The rise of artificial intelligence (AI) offers the potential for assisting clinical decisions. This study aims to assess the accuracy of various artificial intelligence models in providing recommendations for the diagnosis and treatment of postmenopausal osteoporosis.</p><h3>Methods</h3><p>Using questions from the 2020 American Association of Clinical Endocrinologists (AACE) guidelines for osteoporosis, AI models including ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Gemini, Gemini Advanced, and Copilot were prompted. Responses were classified as accurate if they did not contradict the clinical guidelines. Two additional categories, over-conclusive and insufficient, were created to further evaluate responses. Over-conclusive was designated if AI models provided recommendations not specified in the guidelines, while insufficient indicated a failure to provide relevant information included in the guidelines. Chi-square tests were employed to compare categorical outcomes among different AI models.</p><h3>Results</h3><p>A total of 42 clinical questions were evaluated. ChatGPT-4o achieved an accuracy of 88%, ChatGPT-3.5 57.1%, ChatGPT-4.0 64.3%, Gemini 45.2%, Gemini Advanced 57.1%, and Copilot 47.6% (<i>p</i> &lt; 0.001).</p><h3>Conclusions</h3><p>The study reveals significant response accuracy variations across each AI model, with ChatGPT-4o demonstrating the highest accuracy. Further research is necessary to explore the broader applicability of AI in the medical domains.</p></div>\",\"PeriodicalId\":8283,\"journal\":{\"name\":\"Archives of Osteoporosis\",\"volume\":\"20 1\",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Archives of Osteoporosis\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s11657-025-01587-4\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENDOCRINOLOGY & METABOLISM\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Archives of Osteoporosis","FirstCategoryId":"3","ListUrlMain":"https://link.springer.com/article/10.1007/s11657-025-01587-4","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}
引用次数: 0

摘要

本研究评估人工智能模型在评估绝经后骨质疏松症中的表现。我们发现chatgpt - 40产生了最合适的反应,突出了人工智能在骨质疏松症管理中增强临床决策和改善患者护理方面的潜力。人工智能(AI)的兴起为辅助临床决策提供了潜力。本研究旨在评估各种人工智能模型的准确性,为绝经后骨质疏松症的诊断和治疗提供建议。方法根据2020年美国临床内分泌学家协会(AACE)骨质疏松症指南中的问题,提示ChatGPT-3.5、ChatGPT-4.0、chatgpt - 40、Gemini、Gemini Advanced和Copilot等人工智能模型。如果回答不违背临床指南,则被归类为准确。另外设立了两个类别,即结论过度和不够充分,以进一步评价反应。如果人工智能模型提供了指南中未指定的建议,则指定为过度结论性,而不足则表示未能提供指南中包含的相关信息。采用卡方检验比较不同人工智能模型的分类结果。结果共评估了42个临床问题。chatgpt - 40的准确率为88%,ChatGPT-3.5 57.1%, ChatGPT-4.0 64.3%, Gemini 45.2%, Gemini Advanced 57.1%, Copilot 47.6% (p < 0.001)。该研究揭示了每个人工智能模型的响应精度存在显著差异,chatgpt - 40显示出最高的准确性。探索人工智能在医疗领域更广泛的适用性,需要进一步的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Multiple large language models versus clinical guidelines for postmenopausal osteoporosis: a comparative study of ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Google Gemini, Google Gemini Advanced, and Microsoft Copilot

Summary

The study assesses the performance of AI models in evaluating postmenopausal osteoporosis. We found that ChatGPT-4o produced the most appropriate responses, highlighting the potential of AI to enhance clinical decision-making and improve patient care in osteoporosis management.

Purpose

The rise of artificial intelligence (AI) offers the potential for assisting clinical decisions. This study aims to assess the accuracy of various artificial intelligence models in providing recommendations for the diagnosis and treatment of postmenopausal osteoporosis.

Methods

Using questions from the 2020 American Association of Clinical Endocrinologists (AACE) guidelines for osteoporosis, AI models including ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Gemini, Gemini Advanced, and Copilot were prompted. Responses were classified as accurate if they did not contradict the clinical guidelines. Two additional categories, over-conclusive and insufficient, were created to further evaluate responses. Over-conclusive was designated if AI models provided recommendations not specified in the guidelines, while insufficient indicated a failure to provide relevant information included in the guidelines. Chi-square tests were employed to compare categorical outcomes among different AI models.

Results

A total of 42 clinical questions were evaluated. ChatGPT-4o achieved an accuracy of 88%, ChatGPT-3.5 57.1%, ChatGPT-4.0 64.3%, Gemini 45.2%, Gemini Advanced 57.1%, and Copilot 47.6% (p < 0.001).

Conclusions

The study reveals significant response accuracy variations across each AI model, with ChatGPT-4o demonstrating the highest accuracy. Further research is necessary to explore the broader applicability of AI in the medical domains.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Archives of Osteoporosis
Archives of Osteoporosis ENDOCRINOLOGY & METABOLISMORTHOPEDICS -ORTHOPEDICS
CiteScore
5.50
自引率
10.00%
发文量
133
期刊介绍: Archives of Osteoporosis is an international multidisciplinary journal which is a joint initiative of the International Osteoporosis Foundation and the National Osteoporosis Foundation of the USA. The journal will highlight the specificities of different regions around the world concerning epidemiology, reference values for bone density and bone metabolism, as well as clinical aspects of osteoporosis and other bone diseases.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信