Comparative evaluation of ChatGPT and LLaMA for reliability, quality, and accuracy in familial Mediterranean fever.

IF 3 3区医学 Q1 PEDIATRICS

European Journal of Pediatrics Pub Date : 2025-07-18 DOI:10.1007/s00431-025-06318-y

Aslıhan Uzun Bektaş, Balahan Bora, Erbil Ünsal

{"title":"Comparative evaluation of ChatGPT and LLaMA for reliability, quality, and accuracy in familial Mediterranean fever.","authors":"Aslıhan Uzun Bektaş, Balahan Bora, Erbil Ünsal","doi":"10.1007/s00431-025-06318-y","DOIUrl":null,"url":null,"abstract":"Familial Mediterranean fever (FMF) is the most common monogenic autoinflammatory disease. Large language models (LLMs) offer rapid access to medical information. This study evaluated and compared the reliability, quality, and accuracy of ChatGPT-4o and LLaMA-3.1 for FMF. Single Hub and Access Point for Paediatric Rheumatology in Europe (SHARE) and European League Against Rheumatism (EULAR) guidelines were used for question generation and also for answer validation. Thirty-one questions were developed from a clinician's perspective based on the related guidelines. Two pediatric rheumatologists with over 20 years of FMF experience independently and blindly evaluated the responses. Reliability, quality, and accuracy were assessed using the modified DISCERN Scale, Global Quality Score, and the guidelines, respectively. Readability was assessed using multiple established indices. Statistical analyses included the Shapiro-Wilk test to assess normality, followed by paired t-tests for normally distributed scores, and Wilcoxon signed-rank tests for non-normally distributed scores. Both models demonstrated moderate reliability and high response quality. In terms of alignment with the guidelines, LLaMA provided fully aligned, complete, and accurate responses to 51.6% (16/31) of the questions, whereas ChatGPT provided such responses to 80.6% (25/31). While 9.7% (4/31) of LLaMA's responses were entirely contradictory to the guidelines, ChatGPT did not produce any such responses. ChatGPT outperformed LLaMA in terms of accuracy, quality, and reliability, with statistical significance. Readability assessments showed that both LLMs required college-level understanding.Conclusion: While LLMs show great promise, current limitations in accuracy and guideline adherence mean they should supplement, not replace, clinical expertise.Clinical trial registration: This study does not involve clinical trials; therefore, no clinical trial registration is required.What is known: •FMF is the most common hereditary autoinflammatory disease. LLMs are increasingly used to provide clinical information.What is new: •This is the first study to assess two different LLMs in the context of FMF, evaluating their reliability, quality, and alignment with clinical guidelines. ChatGPT-4o outperformed LLaMA-3.1 in reliability,quality, and guideline alignment for FMF. However, both models showed informational gaps that may limit their clinical use.","PeriodicalId":11997,"journal":{"name":"European Journal of Pediatrics","volume":"184 8","pages":"491"},"PeriodicalIF":3.0000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Pediatrics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00431-025-06318-y","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PEDIATRICS","Score":null,"Total":0}

引用次数: 0

Abstract

Familial Mediterranean fever (FMF) is the most common monogenic autoinflammatory disease. Large language models (LLMs) offer rapid access to medical information. This study evaluated and compared the reliability, quality, and accuracy of ChatGPT-4o and LLaMA-3.1 for FMF. Single Hub and Access Point for Paediatric Rheumatology in Europe (SHARE) and European League Against Rheumatism (EULAR) guidelines were used for question generation and also for answer validation. Thirty-one questions were developed from a clinician's perspective based on the related guidelines. Two pediatric rheumatologists with over 20 years of FMF experience independently and blindly evaluated the responses. Reliability, quality, and accuracy were assessed using the modified DISCERN Scale, Global Quality Score, and the guidelines, respectively. Readability was assessed using multiple established indices. Statistical analyses included the Shapiro-Wilk test to assess normality, followed by paired t-tests for normally distributed scores, and Wilcoxon signed-rank tests for non-normally distributed scores. Both models demonstrated moderate reliability and high response quality. In terms of alignment with the guidelines, LLaMA provided fully aligned, complete, and accurate responses to 51.6% (16/31) of the questions, whereas ChatGPT provided such responses to 80.6% (25/31). While 9.7% (4/31) of LLaMA's responses were entirely contradictory to the guidelines, ChatGPT did not produce any such responses. ChatGPT outperformed LLaMA in terms of accuracy, quality, and reliability, with statistical significance. Readability assessments showed that both LLMs required college-level understanding.

Conclusion: While LLMs show great promise, current limitations in accuracy and guideline adherence mean they should supplement, not replace, clinical expertise.

Clinical trial registration: This study does not involve clinical trials; therefore, no clinical trial registration is required.

What is known: •FMF is the most common hereditary autoinflammatory disease. LLMs are increasingly used to provide clinical information.

What is new: •This is the first study to assess two different LLMs in the context of FMF, evaluating their reliability, quality, and alignment with clinical guidelines. ChatGPT-4o outperformed LLaMA-3.1 in reliability,quality, and guideline alignment for FMF. However, both models showed informational gaps that may limit their clinical use.

查看原文本刊更多论文

比较评价ChatGPT和LLaMA在家族性地中海热诊断中的可靠性、质量和准确性。

家族性地中海热是最常见的单基因自身炎症性疾病。大型语言模型（llm）提供对医疗信息的快速访问。本研究评估并比较了chatgpt - 40和LLaMA-3.1用于FMF的可靠性、质量和准确性。欧洲儿科风湿病单中心和接入点（SHARE）和欧洲抗风湿病联盟（EULAR）指南用于问题生成和答案验证。根据相关指南，从临床医生的角度提出了31个问题。两位拥有超过20年FMF经验的儿科风湿病学家独立和盲目地评估了反应。分别使用修改后的DISCERN量表、全球质量评分和指南来评估可靠性、质量和准确性。使用多个既定指标评估可读性。统计分析包括评估正态性的Shapiro-Wilk检验，然后对正态分布的分数进行配对t检验，对非正态分布的分数进行Wilcoxon符号秩检验。两种模型均具有中等的可靠性和较高的响应质量。在与指南的一致性方面，LLaMA对51.6%（16/31）的问题提供了完全一致、完整和准确的回答，而ChatGPT提供了80.6%（25/31）的回答。虽然9.7%（4/31）的LLaMA回复与指南完全矛盾，但ChatGPT没有产生任何这样的回复。ChatGPT在准确性、质量和可靠性方面优于LLaMA，具有统计学意义。可读性评估显示，这两个法学硕士都需要大学水平的理解能力。结论：虽然llm显示出巨大的希望，但目前在准确性和指南依从性方面的局限性意味着它们应该补充而不是取代临床专业知识。临床试验注册：本研究不涉及临床试验；因此，不需要临床试验注册。已知情况：•FMF是最常见的遗传性自身炎症性疾病。法学硕士越来越多地用于提供临床信息。新发现：•这是第一个在FMF背景下评估两种不同llm的研究，评估它们的可靠性、质量和与临床指南的一致性。chatgpt - 40在FMF的可靠性、质量和指南对准方面优于LLaMA-3.1。然而，这两种模型都显示出信息差距，这可能会限制它们的临床应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

European Journal of Pediatrics 医学-小儿科

CiteScore

5.90

自引率

2.80%

发文量

367

审稿时长

3-6 weeks

期刊介绍： The European Journal of Pediatrics (EJPE) is a leading peer-reviewed medical journal which covers the entire field of pediatrics. The editors encourage authors to submit original articles, reviews, short communications, and correspondence on all relevant themes and topics. EJPE is particularly committed to the publication of articles on important new clinical research that will have an immediate impact on clinical pediatric practice. The editorial office very much welcomes ideas for publications, whether individual articles or article series, that fit this goal and is always willing to address inquiries from authors regarding potential submissions. Invited review articles on clinical pediatrics that provide comprehensive coverage of a subject of importance are also regularly commissioned. The short publication time reflects both the commitment of the editors and publishers and their passion for new developments in the field of pediatrics. EJPE is active on social media (@EurJPediatrics) and we invite you to participate. EJPE is the official journal of the European Academy of Paediatrics (EAP) and publishes guidelines and statements in cooperation with the EAP.