From Guidelines to Real-Time Conversation: Expert-Validated Retrieval-Augmented and Fine-Tuned GPT-4 for Hepatitis C Management

IF 5.2 2区医学 Q1 GASTROENTEROLOGY & HEPATOLOGY

Liver International Pub Date : 2025-09-17 DOI:10.1111/liv.70349

Mauro Giuffrè, Nicola Pugliese, Simone Kresevic, Milos Ajcevic, Francesco Negro, Massimo Puoti, Xavier Forns, Jean-Michel Pawlotsky, Dennis L. Shung, Alessio Aghemo

{"title":"From Guidelines to Real-Time Conversation: Expert-Validated Retrieval-Augmented and Fine-Tuned GPT-4 for Hepatitis C Management","authors":"Mauro Giuffrè, Nicola Pugliese, Simone Kresevic, Milos Ajcevic, Francesco Negro, Massimo Puoti, Xavier Forns, Jean-Michel Pawlotsky, Dennis L. Shung, Alessio Aghemo","doi":"10.1111/liv.70349","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background and Aims</h3>\n \n <p>Advances in artificial intelligence, particularly large language models (LLMs), hold promise for transforming chronic disease management such as Hepatitis C Virus (HCV) infection. This study evaluates the impact of retrieval-augmented generation (RAG) and supervised fine-tuning (SFT) on both open-ended question answering (accuracy and clarity) and on LLM-recommended treatment regimens for clinical scenarios.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We employed OpenAI's GPT-4 Turbo in four configurations—baseline, RAG-Top1, RAG-Top 10 and SFT—using the 2020 EASL HCV guidelines as external knowledge or fine-tuning data. For the question set, guidelines were segmented at the paragraph level and encoded into 3072-dimensional embeddings. Fifteen questions covering general, patient and physician perspectives were scored on a 10-point accuracy scale and binary accuracy/clarity by four experts. Separately, we created 25 simulated clinical scenarios; a consensus of four hepatologists defined the gold-standard DAA regimens. Model performance on these cases was measured by two metrics: ‘partial accuracy’ (≥ one correct DAA without errors) and ‘complete accuracy’ (all correct DAAs without errors).</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>On open-ended questions, RAG-Top10 outperformed baseline in accuracy (91.7% vs. 36.6%; <i>p</i> < 0.001) and clarity (91.7% vs. 46.6%; <i>p</i> < 0.001). RAG-Top1 achieved 81.7% accuracy and 86.6% clarity (both <i>p</i> < 0.001), while SFT reached 71.7% accuracy and 88.3% clarity (<i>p</i> < 0.001). Similarly, RAG-Top10 achieved the highest performance in prescribing the correct DAA regimen according to expert consensus in 76% of cases (vs. 24% for baseline model, <i>p</i> < 0.001).</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>Both RAG-Top10 and SFT markedly enhance LLM performance in guideline-driven HCV management—improving not only response accuracy and clarity but also DAA selection in clinical scenarios. RAG-Top10's broader context retrieval confers the greatest gains, while SFT underscores the value of domain-specific alignment. Rigorous, expert-informed evaluation frameworks are essential for the safe integration of LLMs into clinical practice.</p>\n </section>\n </div>","PeriodicalId":18101,"journal":{"name":"Liver International","volume":"45 10","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12442523/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Liver International","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/liv.70349","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background and Aims

Advances in artificial intelligence, particularly large language models (LLMs), hold promise for transforming chronic disease management such as Hepatitis C Virus (HCV) infection. This study evaluates the impact of retrieval-augmented generation (RAG) and supervised fine-tuning (SFT) on both open-ended question answering (accuracy and clarity) and on LLM-recommended treatment regimens for clinical scenarios.

Methods

We employed OpenAI's GPT-4 Turbo in four configurations—baseline, RAG-Top1, RAG-Top 10 and SFT—using the 2020 EASL HCV guidelines as external knowledge or fine-tuning data. For the question set, guidelines were segmented at the paragraph level and encoded into 3072-dimensional embeddings. Fifteen questions covering general, patient and physician perspectives were scored on a 10-point accuracy scale and binary accuracy/clarity by four experts. Separately, we created 25 simulated clinical scenarios; a consensus of four hepatologists defined the gold-standard DAA regimens. Model performance on these cases was measured by two metrics: ‘partial accuracy’ (≥ one correct DAA without errors) and ‘complete accuracy’ (all correct DAAs without errors).

Results

On open-ended questions, RAG-Top10 outperformed baseline in accuracy (91.7% vs. 36.6%; p < 0.001) and clarity (91.7% vs. 46.6%; p < 0.001). RAG-Top1 achieved 81.7% accuracy and 86.6% clarity (both p < 0.001), while SFT reached 71.7% accuracy and 88.3% clarity (p < 0.001). Similarly, RAG-Top10 achieved the highest performance in prescribing the correct DAA regimen according to expert consensus in 76% of cases (vs. 24% for baseline model, p < 0.001).

Conclusions

Both RAG-Top10 and SFT markedly enhance LLM performance in guideline-driven HCV management—improving not only response accuracy and clarity but also DAA selection in clinical scenarios. RAG-Top10's broader context retrieval confers the greatest gains, while SFT underscores the value of domain-specific alignment. Rigorous, expert-informed evaluation frameworks are essential for the safe integration of LLMs into clinical practice.

Abstract Image

查看原文本刊更多论文

从指南到实时对话：专家验证检索增强和微调GPT-4用于丙型肝炎管理。

背景和目的：人工智能的进步，特别是大型语言模型（llm），有望改变慢性疾病的管理，如丙型肝炎病毒（HCV）感染。本研究评估了检索增强生成（RAG）和监督微调（SFT）对开放式问题回答（准确性和清晰度）和llm推荐的临床治疗方案的影响。方法：我们使用OpenAI的GPT-4 Turbo在基线、RAG-Top1、RAG-Top 10和stf四种配置下使用2020年EASL HCV指南作为外部知识或微调数据。对于问题集，指导方针在段落级别进行分割，并编码为3072维嵌入。15个问题涵盖了一般、患者和医生的观点，由4位专家以10分的准确性和二元准确性/清晰度评分。我们分别创建了25个模拟临床场景；四位肝病学家的共识定义了DAA的金标准方案。在这些情况下，模型的性能通过两个指标来衡量：“部分精度”（≥一个正确的DAA，没有错误）和“完全精度”（所有正确的DAA，没有错误）。结果：在开放式问题上，RAG-Top10的准确性优于基线（91.7% vs 36.6%）； p结论：RAG-Top10和SFT均显著提高了LLM在指南驱动型HCV管理中的表现——不仅提高了反应的准确性和清晰度，而且提高了临床场景中DAA的选择。RAG-Top10的更广泛的上下文检索提供了最大的收益，而SFT强调了特定领域对齐的价值。严格的、专家知情的评估框架对于法学硕士安全整合到临床实践中至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Liver International 医学-胃肠肝病学

CiteScore

13.90

自引率

4.50%

发文量

348

审稿时长

2 months

期刊介绍： Liver International promotes all aspects of the science of hepatology from basic research to applied clinical studies. Providing an international forum for the publication of high-quality original research in hepatology, it is an essential resource for everyone working on normal and abnormal structure and function in the liver and its constituent cells, including clinicians and basic scientists involved in the multi-disciplinary field of hepatology. The journal welcomes articles from all fields of hepatology, which may be published as original articles, brief definitive reports, reviews, mini-reviews, images in hepatology and letters to the Editor.