Development of a liver disease-specific large language model chat interface using retrieval-augmented generation.

IF 12.9 1区医学 Q1 GASTROENTEROLOGY & HEPATOLOGY

Hepatology Pub Date : 2024-11-01 Epub Date: 2024-03-07 DOI:10.1097/HEP.0000000000000834

Jin Ge, Steve Sun, Joseph Owens, Victor Galvez, Oksana Gologorskaya, Jennifer C Lai, Mark J Pletcher, Ki Lai

{"title":"Development of a liver disease-specific large language model chat interface using retrieval-augmented generation.","authors":"Jin Ge, Steve Sun, Joseph Owens, Victor Galvez, Oksana Gologorskaya, Jennifer C Lai, Mark J Pletcher, Ki Lai","doi":"10.1097/HEP.0000000000000834","DOIUrl":null,"url":null,"abstract":"Background and aims: Large language models (LLMs) have significant capabilities in clinical information processing tasks. Commercially available LLMs, however, are not optimized for clinical uses and are prone to generating hallucinatory information. Retrieval-augmented generation (RAG) is an enterprise architecture that allows the embedding of customized data into LLMs. This approach \"specializes\" the LLMs and is thought to reduce hallucinations.Approach and results: We developed \"LiVersa,\" a liver disease-specific LLM, by using our institution's protected health information-complaint text embedding and LLM platform, \"Versa.\" We conducted RAG on 30 publicly available American Association for the Study of Liver Diseases guidance documents to be incorporated into LiVersa. We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4.Results: We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4.Conclusions: In this demonstration, we built disease-specific and protected health information-compliant LLMs using RAG. While LiVersa demonstrated higher accuracy in answering questions related to hepatology, there were some deficiencies due to limitations set by the number of documents used for RAG. LiVersa will likely require further refinement before potential live deployment. The LiVersa prototype, however, is a proof of concept for utilizing RAG to customize LLMs for clinical use cases.","PeriodicalId":177,"journal":{"name":"Hepatology","volume":" ","pages":"1158-1168"},"PeriodicalIF":12.9000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11706764/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Hepatology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/HEP.0000000000000834","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/3/7 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background and aims: Large language models (LLMs) have significant capabilities in clinical information processing tasks. Commercially available LLMs, however, are not optimized for clinical uses and are prone to generating hallucinatory information. Retrieval-augmented generation (RAG) is an enterprise architecture that allows the embedding of customized data into LLMs. This approach "specializes" the LLMs and is thought to reduce hallucinations.

Approach and results: We developed "LiVersa," a liver disease-specific LLM, by using our institution's protected health information-complaint text embedding and LLM platform, "Versa." We conducted RAG on 30 publicly available American Association for the Study of Liver Diseases guidance documents to be incorporated into LiVersa. We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4.

Results: We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4.

Conclusions: In this demonstration, we built disease-specific and protected health information-compliant LLMs using RAG. While LiVersa demonstrated higher accuracy in answering questions related to hepatology, there were some deficiencies due to limitations set by the number of documents used for RAG. LiVersa will likely require further refinement before potential live deployment. The LiVersa prototype, however, is a proof of concept for utilizing RAG to customize LLMs for clinical use cases.

查看原文本刊更多论文

利用检索增强生成技术开发肝病专用大型语言模型聊天界面。

背景：大语言模型（LLMs）在临床信息处理任务中具有重要功能。然而，市面上的大型语言模型并没有针对临床用途进行优化，而且容易产生幻觉信息。检索增强生成（RAG）是一种企业架构，可将定制数据嵌入 LLM。这种方法对 LLM 进行了 "专业化 "处理，被认为可以减少幻觉的产生：我们利用本机构的受保护健康信息（PHI）--投诉文本嵌入和 LLM 平台 "Versa"，开发了肝病专用 LLM "LiVersa"。我们对即将纳入 LiVersa 的 30 份公开的美国肝病研究协会指导文件进行了 RAG：我们通过两轮测试评估了 LiVersa 的性能。首先，我们将 LiVersa 的输出结果与之前发布的知识评估中受训人员的输出结果进行了比较。LiVersa正确回答了所有10个问题。其次，我们请 15 位肝病专家对 LiVersa、OpenAI 的 ChatGPT 4 和 Meta 的 LLaMA 2 生成的 10 个肝病学主题问题的输出结果进行评估。与 ChatGPT 4 的输出结果相比，LiVersa 的输出结果更为准确，但在全面性和安全性方面的评分较低：在这次演示中，我们使用 RAG 建立了一个针对特定疾病且符合 PHI 的 LLM。虽然 LiVersa 在回答与肝病学相关的问题时表现出了更高的准确性，但由于 RAG 所用文档数量的限制，还存在一些不足之处。LiVersa 可能还需要进一步改进，才有可能正式部署。不过，LiVersa 原型是利用 RAG 为临床用例定制 LLM 的概念验证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Hepatology 医学-胃肠肝病学

CiteScore

27.50

自引率

3.70%

发文量

609

审稿时长

1 months

期刊介绍： HEPATOLOGY is recognized as the leading publication in the field of liver disease. It features original, peer-reviewed articles covering various aspects of liver structure, function, and disease. The journal's distinguished Editorial Board carefully selects the best articles each month, focusing on topics including immunology, chronic hepatitis, viral hepatitis, cirrhosis, genetic and metabolic liver diseases, liver cancer, and drug metabolism.