Mauro Giuffrè, Nicola Pugliese, Simone Kresevic, Milos Ajcevic, Francesco Negro, Massimo Puoti, Xavier Forns, Jean-Michel Pawlotsky, Dennis L. Shung, Alessio Aghemo
{"title":"From Guidelines to Real-Time Conversation: Expert-Validated Retrieval-Augmented and Fine-Tuned GPT-4 for Hepatitis C Management","authors":"Mauro Giuffrè, Nicola Pugliese, Simone Kresevic, Milos Ajcevic, Francesco Negro, Massimo Puoti, Xavier Forns, Jean-Michel Pawlotsky, Dennis L. Shung, Alessio Aghemo","doi":"10.1111/liv.70349","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background and Aims</h3>\n \n <p>Advances in artificial intelligence, particularly large language models (LLMs), hold promise for transforming chronic disease management such as Hepatitis C Virus (HCV) infection. This study evaluates the impact of retrieval-augmented generation (RAG) and supervised fine-tuning (SFT) on both open-ended question answering (accuracy and clarity) and on LLM-recommended treatment regimens for clinical scenarios.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We employed OpenAI's GPT-4 Turbo in four configurations—baseline, RAG-Top1, RAG-Top 10 and SFT—using the 2020 EASL HCV guidelines as external knowledge or fine-tuning data. For the question set, guidelines were segmented at the paragraph level and encoded into 3072-dimensional embeddings. Fifteen questions covering general, patient and physician perspectives were scored on a 10-point accuracy scale and binary accuracy/clarity by four experts. Separately, we created 25 simulated clinical scenarios; a consensus of four hepatologists defined the gold-standard DAA regimens. Model performance on these cases was measured by two metrics: ‘partial accuracy’ (≥ one correct DAA without errors) and ‘complete accuracy’ (all correct DAAs without errors).</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>On open-ended questions, RAG-Top10 outperformed baseline in accuracy (91.7% vs. 36.6%; <i>p</i> < 0.001) and clarity (91.7% vs. 46.6%; <i>p</i> < 0.001). RAG-Top1 achieved 81.7% accuracy and 86.6% clarity (both <i>p</i> < 0.001), while SFT reached 71.7% accuracy and 88.3% clarity (<i>p</i> < 0.001). Similarly, RAG-Top10 achieved the highest performance in prescribing the correct DAA regimen according to expert consensus in 76% of cases (vs. 24% for baseline model, <i>p</i> < 0.001).</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>Both RAG-Top10 and SFT markedly enhance LLM performance in guideline-driven HCV management—improving not only response accuracy and clarity but also DAA selection in clinical scenarios. RAG-Top10's broader context retrieval confers the greatest gains, while SFT underscores the value of domain-specific alignment. Rigorous, expert-informed evaluation frameworks are essential for the safe integration of LLMs into clinical practice.</p>\n </section>\n </div>","PeriodicalId":18101,"journal":{"name":"Liver International","volume":"45 10","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12442523/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Liver International","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/liv.70349","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background and Aims
Advances in artificial intelligence, particularly large language models (LLMs), hold promise for transforming chronic disease management such as Hepatitis C Virus (HCV) infection. This study evaluates the impact of retrieval-augmented generation (RAG) and supervised fine-tuning (SFT) on both open-ended question answering (accuracy and clarity) and on LLM-recommended treatment regimens for clinical scenarios.
Methods
We employed OpenAI's GPT-4 Turbo in four configurations—baseline, RAG-Top1, RAG-Top 10 and SFT—using the 2020 EASL HCV guidelines as external knowledge or fine-tuning data. For the question set, guidelines were segmented at the paragraph level and encoded into 3072-dimensional embeddings. Fifteen questions covering general, patient and physician perspectives were scored on a 10-point accuracy scale and binary accuracy/clarity by four experts. Separately, we created 25 simulated clinical scenarios; a consensus of four hepatologists defined the gold-standard DAA regimens. Model performance on these cases was measured by two metrics: ‘partial accuracy’ (≥ one correct DAA without errors) and ‘complete accuracy’ (all correct DAAs without errors).
Results
On open-ended questions, RAG-Top10 outperformed baseline in accuracy (91.7% vs. 36.6%; p < 0.001) and clarity (91.7% vs. 46.6%; p < 0.001). RAG-Top1 achieved 81.7% accuracy and 86.6% clarity (both p < 0.001), while SFT reached 71.7% accuracy and 88.3% clarity (p < 0.001). Similarly, RAG-Top10 achieved the highest performance in prescribing the correct DAA regimen according to expert consensus in 76% of cases (vs. 24% for baseline model, p < 0.001).
Conclusions
Both RAG-Top10 and SFT markedly enhance LLM performance in guideline-driven HCV management—improving not only response accuracy and clarity but also DAA selection in clinical scenarios. RAG-Top10's broader context retrieval confers the greatest gains, while SFT underscores the value of domain-specific alignment. Rigorous, expert-informed evaluation frameworks are essential for the safe integration of LLMs into clinical practice.
期刊介绍:
Liver International promotes all aspects of the science of hepatology from basic research to applied clinical studies. Providing an international forum for the publication of high-quality original research in hepatology, it is an essential resource for everyone working on normal and abnormal structure and function in the liver and its constituent cells, including clinicians and basic scientists involved in the multi-disciplinary field of hepatology. The journal welcomes articles from all fields of hepatology, which may be published as original articles, brief definitive reports, reviews, mini-reviews, images in hepatology and letters to the Editor.