Comparative Study to Evaluate the Accuracy of Differential Diagnosis Lists Generated by Gemini Advanced, Gemini, and Bard for a Case Report Series Analysis: Cross-Sectional Study.

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2024-10-02 DOI:10.2196/63010

Takanobu Hirosawa, Yukinori Harada, Kazuki Tokumasu, Takahiro Ito, Tomoharu Suzuki, Taro Shimizu

{"title":"Comparative Study to Evaluate the Accuracy of Differential Diagnosis Lists Generated by Gemini Advanced, Gemini, and Bard for a Case Report Series Analysis: Cross-Sectional Study.","authors":"Takanobu Hirosawa, Yukinori Harada, Kazuki Tokumasu, Takahiro Ito, Tomoharu Suzuki, Taro Shimizu","doi":"10.2196/63010","DOIUrl":null,"url":null,"abstract":"Background: Generative artificial intelligence (GAI) systems by Google have recently been updated from Bard to Gemini and Gemini Advanced as of December 2023. Gemini is a basic, free-to-use model after a user's login, while Gemini Advanced operates on a more advanced model requiring a fee-based subscription. These systems have the potential to enhance medical diagnostics. However, the impact of these updates on comprehensive diagnostic accuracy remains unknown.Objective: This study aimed to compare the accuracy of the differential diagnosis lists generated by Gemini Advanced, Gemini, and Bard across comprehensive medical fields using case report series.Methods: We identified a case report series with relevant final diagnoses published in the American Journal Case Reports from January 2022 to March 2023. After excluding nondiagnostic cases and patients aged 10 years and younger, we included the remaining case reports. After refining the case parts as case descriptions, we input the same case descriptions into Gemini Advanced, Gemini, and Bard to generate the top 10 differential diagnosis lists. In total, 2 expert physicians independently evaluated whether the final diagnosis was included in the lists and its ranking. Any discrepancies were resolved by another expert physician. Bonferroni correction was applied to adjust the P values for the number of comparisons among 3 GAI systems, setting the corrected significance level at P value <.02.Results: In total, 392 case reports were included. The inclusion rates of the final diagnosis within the top 10 differential diagnosis lists were 73% (286/392) for Gemini Advanced, 76.5% (300/392) for Gemini, and 68.6% (269/392) for Bard. The top diagnoses matched the final diagnoses in 31.6% (124/392) for Gemini Advanced, 42.6% (167/392) for Gemini, and 31.4% (123/392) for Bard. Gemini demonstrated higher diagnostic accuracy than Bard both within the top 10 differential diagnosis lists (P=.02) and as the top diagnosis (P=.001). In addition, Gemini Advanced achieved significantly lower accuracy than Gemini in identifying the most probable diagnosis (P=.002).Conclusions: The results of this study suggest that Gemini outperformed Bard in diagnostic accuracy following the model update. However, Gemini Advanced requires further refinement to optimize its performance for future artificial intelligence-enhanced diagnostics. These findings should be interpreted cautiously and considered primarily for research purposes, as these GAI systems have not been adjusted for medical diagnostics nor approved for clinical use.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"12 ","pages":"e63010"},"PeriodicalIF":3.1000,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11483254/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/63010","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Generative artificial intelligence (GAI) systems by Google have recently been updated from Bard to Gemini and Gemini Advanced as of December 2023. Gemini is a basic, free-to-use model after a user's login, while Gemini Advanced operates on a more advanced model requiring a fee-based subscription. These systems have the potential to enhance medical diagnostics. However, the impact of these updates on comprehensive diagnostic accuracy remains unknown.

Objective: This study aimed to compare the accuracy of the differential diagnosis lists generated by Gemini Advanced, Gemini, and Bard across comprehensive medical fields using case report series.

Methods: We identified a case report series with relevant final diagnoses published in the American Journal Case Reports from January 2022 to March 2023. After excluding nondiagnostic cases and patients aged 10 years and younger, we included the remaining case reports. After refining the case parts as case descriptions, we input the same case descriptions into Gemini Advanced, Gemini, and Bard to generate the top 10 differential diagnosis lists. In total, 2 expert physicians independently evaluated whether the final diagnosis was included in the lists and its ranking. Any discrepancies were resolved by another expert physician. Bonferroni correction was applied to adjust the P values for the number of comparisons among 3 GAI systems, setting the corrected significance level at P value <.02.

Results: In total, 392 case reports were included. The inclusion rates of the final diagnosis within the top 10 differential diagnosis lists were 73% (286/392) for Gemini Advanced, 76.5% (300/392) for Gemini, and 68.6% (269/392) for Bard. The top diagnoses matched the final diagnoses in 31.6% (124/392) for Gemini Advanced, 42.6% (167/392) for Gemini, and 31.4% (123/392) for Bard. Gemini demonstrated higher diagnostic accuracy than Bard both within the top 10 differential diagnosis lists (P=.02) and as the top diagnosis (P=.001). In addition, Gemini Advanced achieved significantly lower accuracy than Gemini in identifying the most probable diagnosis (P=.002).

Conclusions: The results of this study suggest that Gemini outperformed Bard in diagnostic accuracy following the model update. However, Gemini Advanced requires further refinement to optimize its performance for future artificial intelligence-enhanced diagnostics. These findings should be interpreted cautiously and considered primarily for research purposes, as these GAI systems have not been adjusted for medical diagnostics nor approved for clinical use.

查看原文本刊更多论文

评估 Gemini Advanced、Gemini 和 Bard 为病例报告系列分析生成的鉴别诊断列表准确性的比较研究：横断面研究。

背景介绍最近，谷歌的生成式人工智能（GAI）系统从 Bard 升级为 Gemini 和 Gemini Advanced，截止日期为 2023 年 12 月。Gemini 是用户登录后免费使用的基本模式，而 Gemini Advanced 则是需要付费订阅的高级模式。这些系统具有提高医疗诊断水平的潜力。然而，这些更新对综合诊断准确性的影响仍是未知数：本研究旨在利用病例报告系列比较 Gemini Advanced、Gemini 和 Bard 生成的鉴别诊断列表在综合医疗领域的准确性：我们确定了 2022 年 1 月至 2023 年 3 月期间发表在《美国病例报告杂志》（American Journal Case Reports）上的具有相关最终诊断的病例报告系列。在排除非诊断性病例和年龄在 10 岁及以下的患者后，我们纳入了剩余的病例报告。将病例部分细化为病例描述后，我们将相同的病例描述输入到 Gemini Advanced、Gemini 和 Bard 中，生成前 10 位鉴别诊断列表。共有两名专家医师独立评估最终诊断是否包含在列表中及其排名。任何差异均由另一位专家医师解决。根据 3 个 GAI 系统之间的比较次数，对 P 值进行了 Bonferroni 校正，将校正后的显著性水平设定为 P 值结果：共纳入 392 份病例报告。最终诊断在前 10 个鉴别诊断列表中的纳入率分别为：Gemini Advanced 73%（286/392）、Gemini 76.5%（300/392）和 Bard 68.6%（269/392）。Gemini Advanced 有 31.6%（124/392）、Gemini 有 42.6%（167/392）和 Bard 有 31.4%（123/392）的最高诊断与最终诊断相吻合。在前 10 个鉴别诊断列表中（P=.02），Gemini 的诊断准确率高于 Bard（P=.001）。此外，在确定最可能的诊断方面，Gemini Advanced 的准确性明显低于 Gemini（P=.002）：本研究结果表明，模型更新后，双子座在诊断准确性方面优于巴德。然而，Gemini Advanced 还需要进一步改进，以优化其在未来人工智能增强诊断中的表现。由于这些 GAI 系统尚未针对医疗诊断进行调整，也未被批准用于临床，因此应谨慎解读这些研究结果，并将其主要用于研究目的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.