Ezequiel Ridruejo , Ernesto Saenz , Jimmy Daza , Heike Bantel , Marcos Girala , Matthias Ebert , Florian Van Bommel , Andreas Geier , Andres Gomez Aldana , Mario Reis Alvares-da-Silvai , Markus Peck-Radosavljevicj , Frank Tacke , Arndt Weinmann , Juan Turnes , Javier Pazo , Andreas Teufel
{"title":"EVIDENCE-BASED DIGITAL SUPPORT IN HEPATOLOGY: RETRIEVAL-AUGMENTED GENERATION'S ROLE IN AUTOIMMUNE LIVER DISEASES MANAGEMENT","authors":"Ezequiel Ridruejo , Ernesto Saenz , Jimmy Daza , Heike Bantel , Marcos Girala , Matthias Ebert , Florian Van Bommel , Andreas Geier , Andres Gomez Aldana , Mario Reis Alvares-da-Silvai , Markus Peck-Radosavljevicj , Frank Tacke , Arndt Weinmann , Juan Turnes , Javier Pazo , Andreas Teufel","doi":"10.1016/j.aohep.2025.101957","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction and Objectives</h3><div>Autoimmune liver diseases (AILDs) present significant diagnostic and management challenges. Following our initial evaluation of Large Language Models (LLMs), we developed and assessed three specialized Retrieval-Augmented Generation (RAG) systems. These systems incorporated comprehensive clinical guidelines and medication safety information to enhance decision support accuracy. Our aim was to evaluate the effectiveness of Retrieval-augmented AI systems in providing evidence-based recommendations for AILD management.</div></div><div><h3>Materials and Methods</h3><div>We engineered three distinct RAG systems: HepaChat, RAG-ChatGPT, and RAG-Claude. Each system integrated 13 international clinical guidelines spanning AIH, PBC, and PSC management. Additionally, we incorporated a comprehensive database containing 12,465 FDA medication warnings to ensure safety protocol adherence. Ten liver specialists (six European, four American) evaluated system responses to 56 standardized clinical questions using a 1-10 Likert scale. Questions addressed disease comprehension, therapeutic approaches, and clinical decision-making across all three major AILDs.</div></div><div><h3>Results</h3><div>Quantitative analysis revealed HepaChat's superior performance (mean score 7.58±1.48) with 33 best-rated responses, compared to RAG-Claude (7.22±1.58, 12 best-rated) and RAG-ChatGPT (7.21±1.67, 9 best-rated). Geographic stratification unveiled variations in evaluation patterns (Americas: 7.97 vs Europe: 6.40). Disease-specific analysis demonstrated HepaChat's excellence in AIH (Europe: 7.12, Americas: 8.17) and PSC management in Europe (6.89), while achieving optimal performance in AIH and PBC in the Americas (8.17 and 8.37, respectively). All three systems showed marked improvement over conventional LLMs (2023 benchmark: 6.72±1.67).</div></div><div><h3>Conclusions</h3><div>This evaluation demonstrates that specialized RAG systems incorporating clinical guidelines and safety protocols can significantly enhance AILD management support. Geographic variations in assessment highlight the importance of considering regional clinical perspectives in AI system development.</div></div>","PeriodicalId":7979,"journal":{"name":"Annals of hepatology","volume":"30 ","pages":"Article 101957"},"PeriodicalIF":4.4000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of hepatology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1665268125001826","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction and Objectives
Autoimmune liver diseases (AILDs) present significant diagnostic and management challenges. Following our initial evaluation of Large Language Models (LLMs), we developed and assessed three specialized Retrieval-Augmented Generation (RAG) systems. These systems incorporated comprehensive clinical guidelines and medication safety information to enhance decision support accuracy. Our aim was to evaluate the effectiveness of Retrieval-augmented AI systems in providing evidence-based recommendations for AILD management.
Materials and Methods
We engineered three distinct RAG systems: HepaChat, RAG-ChatGPT, and RAG-Claude. Each system integrated 13 international clinical guidelines spanning AIH, PBC, and PSC management. Additionally, we incorporated a comprehensive database containing 12,465 FDA medication warnings to ensure safety protocol adherence. Ten liver specialists (six European, four American) evaluated system responses to 56 standardized clinical questions using a 1-10 Likert scale. Questions addressed disease comprehension, therapeutic approaches, and clinical decision-making across all three major AILDs.
Results
Quantitative analysis revealed HepaChat's superior performance (mean score 7.58±1.48) with 33 best-rated responses, compared to RAG-Claude (7.22±1.58, 12 best-rated) and RAG-ChatGPT (7.21±1.67, 9 best-rated). Geographic stratification unveiled variations in evaluation patterns (Americas: 7.97 vs Europe: 6.40). Disease-specific analysis demonstrated HepaChat's excellence in AIH (Europe: 7.12, Americas: 8.17) and PSC management in Europe (6.89), while achieving optimal performance in AIH and PBC in the Americas (8.17 and 8.37, respectively). All three systems showed marked improvement over conventional LLMs (2023 benchmark: 6.72±1.67).
Conclusions
This evaluation demonstrates that specialized RAG systems incorporating clinical guidelines and safety protocols can significantly enhance AILD management support. Geographic variations in assessment highlight the importance of considering regional clinical perspectives in AI system development.
期刊介绍:
Annals of Hepatology publishes original research on the biology and diseases of the liver in both humans and experimental models. Contributions may be submitted as regular articles. The journal also publishes concise reviews of both basic and clinical topics.