{"title":"Comparative evaluation of ChatGPT and LLaMA for reliability, quality, and accuracy in familial Mediterranean fever.","authors":"Aslıhan Uzun Bektaş, Balahan Bora, Erbil Ünsal","doi":"10.1007/s00431-025-06318-y","DOIUrl":null,"url":null,"abstract":"<p><p>Familial Mediterranean fever (FMF) is the most common monogenic autoinflammatory disease. Large language models (LLMs) offer rapid access to medical information. This study evaluated and compared the reliability, quality, and accuracy of ChatGPT-4o and LLaMA-3.1 for FMF. Single Hub and Access Point for Paediatric Rheumatology in Europe (SHARE) and European League Against Rheumatism (EULAR) guidelines were used for question generation and also for answer validation. Thirty-one questions were developed from a clinician's perspective based on the related guidelines. Two pediatric rheumatologists with over 20 years of FMF experience independently and blindly evaluated the responses. Reliability, quality, and accuracy were assessed using the modified DISCERN Scale, Global Quality Score, and the guidelines, respectively. Readability was assessed using multiple established indices. Statistical analyses included the Shapiro-Wilk test to assess normality, followed by paired t-tests for normally distributed scores, and Wilcoxon signed-rank tests for non-normally distributed scores. Both models demonstrated moderate reliability and high response quality. In terms of alignment with the guidelines, LLaMA provided fully aligned, complete, and accurate responses to 51.6% (16/31) of the questions, whereas ChatGPT provided such responses to 80.6% (25/31). While 9.7% (4/31) of LLaMA's responses were entirely contradictory to the guidelines, ChatGPT did not produce any such responses. ChatGPT outperformed LLaMA in terms of accuracy, quality, and reliability, with statistical significance. Readability assessments showed that both LLMs required college-level understanding.</p><p><strong>Conclusion: </strong>While LLMs show great promise, current limitations in accuracy and guideline adherence mean they should supplement, not replace, clinical expertise.</p><p><strong>Clinical trial registration: </strong>This study does not involve clinical trials; therefore, no clinical trial registration is required.</p><p><strong>What is known: </strong>•FMF is the most common hereditary autoinflammatory disease. LLMs are increasingly used to provide clinical information.</p><p><strong>What is new: </strong>•This is the first study to assess two different LLMs in the context of FMF, evaluating their reliability, quality, and alignment with clinical guidelines. ChatGPT-4o outperformed LLaMA-3.1 in reliability,quality, and guideline alignment for FMF. However, both models showed informational gaps that may limit their clinical use.</p>","PeriodicalId":11997,"journal":{"name":"European Journal of Pediatrics","volume":"184 8","pages":"491"},"PeriodicalIF":3.0000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Pediatrics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00431-025-06318-y","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PEDIATRICS","Score":null,"Total":0}
引用次数: 0
Abstract
Familial Mediterranean fever (FMF) is the most common monogenic autoinflammatory disease. Large language models (LLMs) offer rapid access to medical information. This study evaluated and compared the reliability, quality, and accuracy of ChatGPT-4o and LLaMA-3.1 for FMF. Single Hub and Access Point for Paediatric Rheumatology in Europe (SHARE) and European League Against Rheumatism (EULAR) guidelines were used for question generation and also for answer validation. Thirty-one questions were developed from a clinician's perspective based on the related guidelines. Two pediatric rheumatologists with over 20 years of FMF experience independently and blindly evaluated the responses. Reliability, quality, and accuracy were assessed using the modified DISCERN Scale, Global Quality Score, and the guidelines, respectively. Readability was assessed using multiple established indices. Statistical analyses included the Shapiro-Wilk test to assess normality, followed by paired t-tests for normally distributed scores, and Wilcoxon signed-rank tests for non-normally distributed scores. Both models demonstrated moderate reliability and high response quality. In terms of alignment with the guidelines, LLaMA provided fully aligned, complete, and accurate responses to 51.6% (16/31) of the questions, whereas ChatGPT provided such responses to 80.6% (25/31). While 9.7% (4/31) of LLaMA's responses were entirely contradictory to the guidelines, ChatGPT did not produce any such responses. ChatGPT outperformed LLaMA in terms of accuracy, quality, and reliability, with statistical significance. Readability assessments showed that both LLMs required college-level understanding.
Conclusion: While LLMs show great promise, current limitations in accuracy and guideline adherence mean they should supplement, not replace, clinical expertise.
Clinical trial registration: This study does not involve clinical trials; therefore, no clinical trial registration is required.
What is known: •FMF is the most common hereditary autoinflammatory disease. LLMs are increasingly used to provide clinical information.
What is new: •This is the first study to assess two different LLMs in the context of FMF, evaluating their reliability, quality, and alignment with clinical guidelines. ChatGPT-4o outperformed LLaMA-3.1 in reliability,quality, and guideline alignment for FMF. However, both models showed informational gaps that may limit their clinical use.
期刊介绍:
The European Journal of Pediatrics (EJPE) is a leading peer-reviewed medical journal which covers the entire field of pediatrics. The editors encourage authors to submit original articles, reviews, short communications, and correspondence on all relevant themes and topics.
EJPE is particularly committed to the publication of articles on important new clinical research that will have an immediate impact on clinical pediatric practice. The editorial office very much welcomes ideas for publications, whether individual articles or article series, that fit this goal and is always willing to address inquiries from authors regarding potential submissions. Invited review articles on clinical pediatrics that provide comprehensive coverage of a subject of importance are also regularly commissioned.
The short publication time reflects both the commitment of the editors and publishers and their passion for new developments in the field of pediatrics.
EJPE is active on social media (@EurJPediatrics) and we invite you to participate.
EJPE is the official journal of the European Academy of Paediatrics (EAP) and publishes guidelines and statements in cooperation with the EAP.