Mehmet Emin Gerek, Tuğba Önalan, Fatih Çölkesen, Şevket Arslan
{"title":"Evaluating large language models for WAO/EAACI guideline compliance in hereditary angioedema management.","authors":"Mehmet Emin Gerek, Tuğba Önalan, Fatih Çölkesen, Şevket Arslan","doi":"10.15586/aei.v53i4.1353","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Hereditary angioedema (HAE) is a rare but potentially life-threatening disorder characterized by recurrent swelling episodes. Adherence to clinical guidelines, such as the World Allergy Organization/European Academy of Allergy & Clinical Immunology (WAO/EAACI) guidelines, is crucial for effective management. With the increasing role of artificial intelligence in medicine, large language models (LLMs) offer potential for clinical decision support. This study evaluates the performance of ChatGPT, Gemini, Perplexity, and Copilot in providing guideline-adherent responses for HAE management.</p><p><strong>Methods: </strong>Twenty-eight key recommendations from the WAO/EAACI HAE guidelines were reformulated into interrogative formats and posed to the selected LLMs. Two independent clinicians assessed responses based on accuracy, adequacy, clarity, and citation reliability using a five-point Likert scale. References were categorized as guideline-based, trustworthy, or untrustworthy. A reevaluation with explicit citation instructions was conducted, with discrepancies resolved by a third reviewer.</p><p><strong>Results: </strong>ChatGPT and Gemini outperformed Perplexity and Copilot, achieving median accuracy and adequacy scores of 5.0 versus 3.0, respectively. ChatGPT had the lowest rate of unreliable references, whereas Gemini showed inconsistency in citation behavior. Significant differences in response quality were observed among models (<i>p</i> < 0.001). Providing explicit sourcing instructions improved performance consistency, particularly for Gemini.</p><p><strong>Conclusion: </strong>ChatGPT and Gemini demonstrated superior adherence to WAO/EAACI guidelines, suggesting that LLMs can support clinical decision-making in rare diseases. However, inconsistencies in citation practices highlight the need for further validation and optimization to enhance reliability in medical applications.</p>","PeriodicalId":7536,"journal":{"name":"Allergologia et immunopathologia","volume":"53 4","pages":"51-59"},"PeriodicalIF":2.1000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Allergologia et immunopathologia","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.15586/aei.v53i4.1353","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"ALLERGY","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Hereditary angioedema (HAE) is a rare but potentially life-threatening disorder characterized by recurrent swelling episodes. Adherence to clinical guidelines, such as the World Allergy Organization/European Academy of Allergy & Clinical Immunology (WAO/EAACI) guidelines, is crucial for effective management. With the increasing role of artificial intelligence in medicine, large language models (LLMs) offer potential for clinical decision support. This study evaluates the performance of ChatGPT, Gemini, Perplexity, and Copilot in providing guideline-adherent responses for HAE management.
Methods: Twenty-eight key recommendations from the WAO/EAACI HAE guidelines were reformulated into interrogative formats and posed to the selected LLMs. Two independent clinicians assessed responses based on accuracy, adequacy, clarity, and citation reliability using a five-point Likert scale. References were categorized as guideline-based, trustworthy, or untrustworthy. A reevaluation with explicit citation instructions was conducted, with discrepancies resolved by a third reviewer.
Results: ChatGPT and Gemini outperformed Perplexity and Copilot, achieving median accuracy and adequacy scores of 5.0 versus 3.0, respectively. ChatGPT had the lowest rate of unreliable references, whereas Gemini showed inconsistency in citation behavior. Significant differences in response quality were observed among models (p < 0.001). Providing explicit sourcing instructions improved performance consistency, particularly for Gemini.
Conclusion: ChatGPT and Gemini demonstrated superior adherence to WAO/EAACI guidelines, suggesting that LLMs can support clinical decision-making in rare diseases. However, inconsistencies in citation practices highlight the need for further validation and optimization to enhance reliability in medical applications.
期刊介绍:
Founded in 1972 by Professor A. Oehling, Allergologia et Immunopathologia is a forum for those working in the field of pediatric asthma, allergy and immunology. Manuscripts related to clinical, epidemiological and experimental allergy and immunopathology related to childhood will be considered for publication. Allergologia et Immunopathologia is the official journal of the Spanish Society of Pediatric Allergy and Clinical Immunology (SEICAP) and also of the Latin American Society of Immunodeficiencies (LASID). It has and independent international Editorial Committee which submits received papers for peer-reviewing by international experts. The journal accepts original and review articles from all over the world, together with consensus statements from the aforementioned societies. Occasionally, the opinion of an expert on a burning topic is published in the "Point of View" section. Letters to the Editor on previously published papers are welcomed. Allergologia et Immunopathologia publishes 6 issues per year and is included in the major databases such as Pubmed, Scopus, Web of Knowledge, etc.