Comparative Analysis of Diagnostic Performance: Differential Diagnosis Lists by LLaMA3 Versus LLaMA2 for Case Reports.

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES

JMIR Formative Research Pub Date : 2024-11-19 DOI:10.2196/64844

Takanobu Hirosawa, Yukinori Harada, Kazuki Tokumasu, Tatsuya Shiraishi, Tomoharu Suzuki, Taro Shimizu

{"title":"Comparative Analysis of Diagnostic Performance: Differential Diagnosis Lists by LLaMA3 Versus LLaMA2 for Case Reports.","authors":"Takanobu Hirosawa, Yukinori Harada, Kazuki Tokumasu, Tatsuya Shiraishi, Tomoharu Suzuki, Taro Shimizu","doi":"10.2196/64844","DOIUrl":null,"url":null,"abstract":"Background: Generative artificial intelligence (AI), particularly in the form of large language models, has rapidly developed. The LLaMA series are popular and recently updated from LLaMA2 to LLaMA3. However, the impacts of the update on diagnostic performance have not been well documented.Objective: We conducted a comparative evaluation of the diagnostic performance in differential diagnosis lists generated by LLaMA3 and LLaMA2 for case reports.Methods: We analyzed case reports published in the American Journal of Case Reports from 2022 to 2023. After excluding nondiagnostic and pediatric cases, we input the remaining cases into LLaMA3 and LLaMA2 using the same prompt and the same adjustable parameters. Diagnostic performance was defined by whether the differential diagnosis lists included the final diagnosis. Multiple physicians independently evaluated whether the final diagnosis was included in the top 10 differentials generated by LLaMA3 and LLaMA2.Results: In our comparative evaluation of the diagnostic performance between LLaMA3 and LLaMA2, we analyzed differential diagnosis lists for 392 case reports. The final diagnosis was included in the top 10 differentials generated by LLaMA3 in 79.6% (312/392) of the cases, compared to 49.7% (195/392) for LLaMA2, indicating a statistically significant improvement (P<.001). Additionally, LLaMA3 showed higher performance in including the final diagnosis in the top 5 differentials, observed in 63% (247/392) of cases, compared to LLaMA2's 38% (149/392, P<.001). Furthermore, the top diagnosis was accurately identified by LLaMA3 in 33.9% (133/392) of cases, significantly higher than the 22.7% (89/392) achieved by LLaMA2 (P<.001). The analysis across various medical specialties revealed variations in diagnostic performance with LLaMA3 consistently outperforming LLaMA2.Conclusions: The results reveal that the LLaMA3 model significantly outperforms LLaMA2 per diagnostic performance, with a higher percentage of case reports having the final diagnosis listed within the top 10, top 5, and as the top diagnosis. Overall diagnostic performance improved almost 1.5 times from LLaMA2 to LLaMA3. These findings support the rapid development and continuous refinement of generative AI systems to enhance diagnostic processes in medicine. However, these findings should be carefully interpreted for clinical application, as generative AI, including the LLaMA series, has not been approved for medical applications such as AI-enhanced diagnostics.","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":"8 ","pages":"e64844"},"PeriodicalIF":2.0000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11615545/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/64844","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Generative artificial intelligence (AI), particularly in the form of large language models, has rapidly developed. The LLaMA series are popular and recently updated from LLaMA2 to LLaMA3. However, the impacts of the update on diagnostic performance have not been well documented.

Objective: We conducted a comparative evaluation of the diagnostic performance in differential diagnosis lists generated by LLaMA3 and LLaMA2 for case reports.

Methods: We analyzed case reports published in the American Journal of Case Reports from 2022 to 2023. After excluding nondiagnostic and pediatric cases, we input the remaining cases into LLaMA3 and LLaMA2 using the same prompt and the same adjustable parameters. Diagnostic performance was defined by whether the differential diagnosis lists included the final diagnosis. Multiple physicians independently evaluated whether the final diagnosis was included in the top 10 differentials generated by LLaMA3 and LLaMA2.

Results: In our comparative evaluation of the diagnostic performance between LLaMA3 and LLaMA2, we analyzed differential diagnosis lists for 392 case reports. The final diagnosis was included in the top 10 differentials generated by LLaMA3 in 79.6% (312/392) of the cases, compared to 49.7% (195/392) for LLaMA2, indicating a statistically significant improvement (P<.001). Additionally, LLaMA3 showed higher performance in including the final diagnosis in the top 5 differentials, observed in 63% (247/392) of cases, compared to LLaMA2's 38% (149/392, P<.001). Furthermore, the top diagnosis was accurately identified by LLaMA3 in 33.9% (133/392) of cases, significantly higher than the 22.7% (89/392) achieved by LLaMA2 (P<.001). The analysis across various medical specialties revealed variations in diagnostic performance with LLaMA3 consistently outperforming LLaMA2.

Conclusions: The results reveal that the LLaMA3 model significantly outperforms LLaMA2 per diagnostic performance, with a higher percentage of case reports having the final diagnosis listed within the top 10, top 5, and as the top diagnosis. Overall diagnostic performance improved almost 1.5 times from LLaMA2 to LLaMA3. These findings support the rapid development and continuous refinement of generative AI systems to enhance diagnostic processes in medicine. However, these findings should be carefully interpreted for clinical application, as generative AI, including the LLaMA series, has not been approved for medical applications such as AI-enhanced diagnostics.

Abstract Image

查看原文本刊更多论文

诊断性能比较分析：LLaMA3 与 LLaMA2 的病例报告鉴别诊断列表。

背景：生成式人工智能（AI），尤其是以大型语言模型为形式的生成式人工智能发展迅速。LLaMA 系列很受欢迎，最近已从 LLaMA2 升级到 LLaMA3。然而，更新对诊断性能的影响还没有得到很好的记录：我们对 LLaMA3 和 LLaMA2 生成的病例报告鉴别诊断列表的诊断性能进行了比较评估：我们分析了 2022 年至 2023 年发表在《美国病例报告杂志》（American Journal of Case Reports）上的病例报告。在排除非诊断性病例和儿科病例后，我们使用相同的提示和可调参数将剩余病例输入 LLaMA3 和 LLaMA2。诊断性能以鉴别诊断列表是否包含最终诊断来定义。多名医生独立评估最终诊断是否包含在 LLaMA3 和 LLaMA2 生成的前 10 个鉴别诊断中：在对 LLaMA3 和 LLaMA2 的诊断性能进行比较评估时，我们分析了 392 份病例报告的鉴别诊断列表。有 79.6% 的病例（312/392 例）的最终诊断被列入 LLaMA3 生成的前 10 个鉴别诊断中，而 LLaMA2 只有 49.7% 的病例（195/392 例）被列入前 10 个鉴别诊断中，这表明 LLaMA3 和 LLaMA2 的诊断性能在统计学上有显著提高（结论：结果显示，LLaMA3 和 LLaMA2 在诊断性能上有显著差异：结果表明，LLaMA3 模型的诊断性能明显优于 LLaMA2，有更高比例的病例报告的最终诊断被列入前 10 名、前 5 名和最高诊断。从 LLaMA2 到 LLaMA3，整体诊断性能提高了近 1.5 倍。这些发现支持了生成式人工智能系统的快速发展和不断完善，以提高医学诊断过程。不过，在临床应用中应谨慎解读这些发现，因为生成式人工智能（包括 LLaMA 系列）尚未获准用于人工智能增强诊断等医疗应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊