Comparative Analysis of AI Tools for Disseminating ADA 2025 Diabetes Care Standards: Implications for Cardiovascular Physicians

IF 3.7 2区医学 Q2 ENDOCRINOLOGY & METABOLISM

Journal of Diabetes Pub Date : 2025-03-06 DOI:10.1111/1753-0407.70072

Tengfei Zheng

{"title":"Comparative Analysis of AI Tools for Disseminating ADA 2025 Diabetes Care Standards: Implications for Cardiovascular Physicians","authors":"Tengfei Zheng","doi":"10.1111/1753-0407.70072","DOIUrl":null,"url":null,"abstract":"Artificial intelligence (AI) models are increasingly used in clinical practice, including medical education and the dissemination of updated clinical guidelines. In this study, we evaluated four AI tools—ChatGPT-4o, ChatGPT-o1, ChatGPT-o3Mini, and DeepSeek—to assess their ability to summarize the Standards of Care in Diabetes—2025 from the American Diabetes Association (ADA) for cardiovascular physicians in primary care settings [1].Using a standardized prompt, we compared the AI-generated summaries across 10 key metrics, including accuracy (alignment with ADA 2025 guidelines), completeness (inclusion of core topics such as glycemic targets, blood pressure management, lipid control, and pharmacologic strategies), clarity (readability and conciseness for cardiovascular physicians), clinical relevance (utility for real-world cardiovascular practice), consistency (logical coherence and uniformity in recommendations), evidence support (reference to supporting studies and ADA standards), ethics (neutral and evidence-based recommendations), timeliness (inclusion of the latest ADA updates), actionability (practical guidance for cardiovascular physicians), and fluency (professional language and structure). Each AI tool was rated on a 0–5 scale for each category, yielding a total possible score of 50 points. All summaries were anonymized to remove identifiers. Each model (ChatGPT-4o, ChatGPT-o1, ChatGPT-o3Mini, and DeepSeek) was then tasked with evaluating all four anonymized summaries, including its own output, using the predefined 10 metrics. For each model, the four scores assigned by the evaluators (including self-evaluation) were averaged to calculate the final score per metric.Our evaluation showed that ChatGPT-o1 performed best (48.3/50), excelling in completeness (5.0), clinical relevance (5.0), and actionability (5.0), with comprehensive coverage of diabetes screening, cardiovascular risk assessment, hypertension/lipid management, and multidisciplinary collaboration (Table 1). However, its evidence support (4.0) required improvement. ChatGPT-4o (45.5/50) demonstrated strengths in clarity (4.8) and structure but had limitations in timeliness (4.5) and evidence support (3.3), as it failed to incorporate 2025 guideline updates and lacked specific research references. The free models, O3Mini (47.3/50) and DeepSeek (47.3/50), performed comparably to paid tools. O3Mini excelled in consistency (5.0) and CKD/heart failure monitoring, while DeepSeek prioritized concise cardiovascular risk management (clarity: 5.0). Both free models, however, scored lower in completeness (O3Mini: 4.8; DeepSeek: 4.5) and evidence support (O3Mini: 4.0; DeepSeek: 3.8), reflecting insufficient integration of 2025 updates and trial data (Table 1).Among the most critical takeaways for cardiovascular physicians were the importance of individualized glycemic targets, the use of SGLT2 inhibitors and GLP-1 receptor agonists for cardiovascular protection, and the necessity of multidisciplinary collaboration for diabetes management. However, while AI-generated summaries provide a convenient way to access guidelines, the lack of explicit reference to primary sources remains a limitation that requires human oversight.Given the potential for AI to support clinical decision-making, integrating these models with validated medical sources and interactive decision-support systems could further enhance their utility [2]. Cost considerations are notable: DeepSeek and O3Mini are open-access, whereas ChatGPT-4o and ChatGPT-o1 require subscriptions. Despite its higher cost, ChatGPT-o1 outperformed other tools, while ChatGPT-4o lagged behind free models in total scores, raising questions about its cost-effectiveness (Table 1). Future studies should explore the integration of AI-generated summaries with interactive decision-support systems to optimize patient care.T.Z. conceived and designed the study, performed all analyses, interpreted the results, and wrote the manuscript. The author approved the final version of the manuscript and takes full responsibility for all aspects of the work.The author has nothing to report.The author declares no conflicts of interest.","PeriodicalId":189,"journal":{"name":"Journal of Diabetes","volume":"17 3","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/1753-0407.70072","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Diabetes","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/1753-0407.70072","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}

引用次数: 0

Abstract

Artificial intelligence (AI) models are increasingly used in clinical practice, including medical education and the dissemination of updated clinical guidelines. In this study, we evaluated four AI tools—ChatGPT-4o, ChatGPT-o1, ChatGPT-o3Mini, and DeepSeek—to assess their ability to summarize the Standards of Care in Diabetes—2025 from the American Diabetes Association (ADA) for cardiovascular physicians in primary care settings [1].

Using a standardized prompt, we compared the AI-generated summaries across 10 key metrics, including accuracy (alignment with ADA 2025 guidelines), completeness (inclusion of core topics such as glycemic targets, blood pressure management, lipid control, and pharmacologic strategies), clarity (readability and conciseness for cardiovascular physicians), clinical relevance (utility for real-world cardiovascular practice), consistency (logical coherence and uniformity in recommendations), evidence support (reference to supporting studies and ADA standards), ethics (neutral and evidence-based recommendations), timeliness (inclusion of the latest ADA updates), actionability (practical guidance for cardiovascular physicians), and fluency (professional language and structure). Each AI tool was rated on a 0–5 scale for each category, yielding a total possible score of 50 points. All summaries were anonymized to remove identifiers. Each model (ChatGPT-4o, ChatGPT-o1, ChatGPT-o3Mini, and DeepSeek) was then tasked with evaluating all four anonymized summaries, including its own output, using the predefined 10 metrics. For each model, the four scores assigned by the evaluators (including self-evaluation) were averaged to calculate the final score per metric.

Our evaluation showed that ChatGPT-o1 performed best (48.3/50), excelling in completeness (5.0), clinical relevance (5.0), and actionability (5.0), with comprehensive coverage of diabetes screening, cardiovascular risk assessment, hypertension/lipid management, and multidisciplinary collaboration (Table 1). However, its evidence support (4.0) required improvement. ChatGPT-4o (45.5/50) demonstrated strengths in clarity (4.8) and structure but had limitations in timeliness (4.5) and evidence support (3.3), as it failed to incorporate 2025 guideline updates and lacked specific research references. The free models, O3Mini (47.3/50) and DeepSeek (47.3/50), performed comparably to paid tools. O3Mini excelled in consistency (5.0) and CKD/heart failure monitoring, while DeepSeek prioritized concise cardiovascular risk management (clarity: 5.0). Both free models, however, scored lower in completeness (O3Mini: 4.8; DeepSeek: 4.5) and evidence support (O3Mini: 4.0; DeepSeek: 3.8), reflecting insufficient integration of 2025 updates and trial data (Table 1).

Among the most critical takeaways for cardiovascular physicians were the importance of individualized glycemic targets, the use of SGLT2 inhibitors and GLP-1 receptor agonists for cardiovascular protection, and the necessity of multidisciplinary collaboration for diabetes management. However, while AI-generated summaries provide a convenient way to access guidelines, the lack of explicit reference to primary sources remains a limitation that requires human oversight.

Given the potential for AI to support clinical decision-making, integrating these models with validated medical sources and interactive decision-support systems could further enhance their utility [2]. Cost considerations are notable: DeepSeek and O3Mini are open-access, whereas ChatGPT-4o and ChatGPT-o1 require subscriptions. Despite its higher cost, ChatGPT-o1 outperformed other tools, while ChatGPT-4o lagged behind free models in total scores, raising questions about its cost-effectiveness (Table 1). Future studies should explore the integration of AI-generated summaries with interactive decision-support systems to optimize patient care.

T.Z. conceived and designed the study, performed all analyses, interpreted the results, and wrote the manuscript. The author approved the final version of the manuscript and takes full responsibility for all aspects of the work.

The author has nothing to report.

The author declares no conflicts of interest.

查看原文本刊更多论文

人工智能工具传播ADA 2025糖尿病护理标准的比较分析：对心血管医生的影响

人工智能（AI）模型越来越多地应用于临床实践，包括医学教育和最新临床指南的传播。在这项研究中，我们评估了四种人工智能工具——chatgpt - 40、chatgpt - 01、chatgpt - 3mini和deepseek，以评估它们总结美国糖尿病协会（ADA）为初级保健机构心血管医生制定的《糖尿病护理标准- 2025》的能力[10]。使用标准化提示，我们比较了人工智能生成的摘要在10个关键指标上的差异，包括准确性（与ADA 2025指南的一致性）、完整性（包括核心主题，如血糖目标、血压管理、脂质控制和药物策略）、清晰度（心血管医生的可读性和简洁性）、临床相关性（现实世界心血管实践的实用性）、一致性（建议的逻辑一致性和统一性）、证据支持（参考支持性研究和ADA标准）、伦理（中立和基于证据的建议）、及时性（包括最新的ADA更新）、可操作性（心血管医生的实用指导）和流畅性（专业语言和结构）。每个人工智能工具在每个类别中被评为0-5分，总得分为50分。所有摘要都经过匿名处理，以去除标识符。然后，每个模型（chatgpt - 40、chatgpt - 01、ChatGPT-o3Mini和DeepSeek）的任务是使用预定义的10个指标评估所有四个匿名摘要，包括它自己的输出。对于每个模型，评估者分配的四个分数（包括自我评估）被平均，以计算每个度量的最终分数。我们的评估显示，chatgpt - 01表现最佳（48.3/50），在完整性（5.0）、临床相关性（5.0）和可操作性（5.0）方面表现出色，全面覆盖了糖尿病筛查、心血管风险评估、高血压/脂质管理和多学科协作（表1）。然而，其证据支持（4.0）有待改进。chatggt - 40（45.5/50）在清晰度（4.8分）和结构方面表现出优势，但在及时性（4.5分）和证据支持（3.3分）方面存在局限性，因为它未能纳入2025年指南更新，缺乏具体的研究参考。免费工具O3Mini（47.3/50）和DeepSeek（47.3/50）的表现与付费工具相当。O3Mini在一致性（5.0）和CKD/心力衰竭监测方面表现出色，而DeepSeek则优先考虑简明的心血管风险管理（清晰度5.0）。然而，两个免费模型在完整性方面得分较低(O3Mini: 4.8；DeepSeek: 4.5)和证据支持(O3Mini: 4.0；DeepSeek: 3.8)，反映了2025年更新和试验数据的整合不足（表1）。对心血管医生来说，最关键的结论是个体化血糖目标的重要性，SGLT2抑制剂和GLP-1受体激动剂对心血管保护的使用，以及糖尿病管理中多学科合作的必要性。然而，尽管人工智能生成的摘要提供了一种方便的方式来获取指导方针，但缺乏对主要来源的明确参考仍然是一个需要人工监督的限制。考虑到人工智能支持临床决策的潜力，将这些模型与经过验证的医疗资源和交互式决策支持系统集成可以进一步提高它们的实用性。成本方面的考虑值得注意：DeepSeek和O3Mini是开放访问的，而chatgpt - 40和chatgpt - 01则需要订阅。尽管成本较高，chatgpt - 01的表现优于其他工具，而chatgpt - 40的总分落后于免费模型，这引发了对其成本效益的质疑（表1）。未来的研究应探索将人工智能生成的摘要与交互式决策支持系统相结合，以优化患者护理。构思和设计研究，执行所有分析，解释结果，并撰写手稿。作者审定了稿件的最终版本，并对工作的各个方面承担全部责任。作者没有什么可报道的。作者声明无利益冲突。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Diabetes ENDOCRINOLOGY & METABOLISM-

CiteScore

6.50

自引率

2.20%

发文量

审稿时长

>12 weeks

期刊介绍： Journal of Diabetes (JDB) devotes itself to diabetes research, therapeutics, and education. It aims to involve researchers and practitioners in a dialogue between East and West via all aspects of epidemiology, etiology, pathogenesis, management, complications and prevention of diabetes, including the molecular, biochemical, and physiological aspects of diabetes. The Editorial team is international with a unique mix of Asian and Western participation. The Editors welcome submissions in form of original research articles, images, novel case reports and correspondence, and will solicit reviews, point-counterpoint, commentaries, editorials, news highlights, and educational content.