Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model.

IF 7.7

PLOS digital health Pub Date : 2024-08-21 eCollection Date: 2024-08-01 DOI:10.1371/journal.pdig.0000568

David Soong, Sriram Sridhar, Han Si, Jan-Samuel Wagner, Ana Caroline Costa Sá, Christina Y Yu, Kubra Karagoz, Meijian Guan, Sanyam Kumar, Hisham Hamadeh, Brandon W Higgs

{"title":"Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model.","authors":"David Soong, Sriram Sridhar, Han Si, Jan-Samuel Wagner, Ana Caroline Costa Sá, Christina Y Yu, Kubra Karagoz, Meijian Guan, Sanyam Kumar, Hisham Hamadeh, Brandon W Higgs","doi":"10.1371/journal.pdig.0000568","DOIUrl":null,"url":null,"abstract":"<p><p>Large language models (LLMs) have made a significant impact on the fields of general artificial intelligence. General purpose LLMs exhibit strong logic and reasoning skills and general world knowledge but can sometimes generate misleading results when prompted on specific subject areas. LLMs trained with domain-specific knowledge can reduce the generation of misleading information (i.e. hallucinations) and enhance the precision of LLMs in specialized contexts. Training new LLMs on specific corpora however can be resource intensive. Here we explored the use of a retrieval-augmented generation (RAG) model which we tested on literature specific to a biomedical research area. OpenAI's GPT-3.5, GPT-4, Microsoft's Prometheus, and a custom RAG model were used to answer 19 questions pertaining to diffuse large B-cell lymphoma (DLBCL) disease biology and treatment. Eight independent reviewers assessed LLM responses based on accuracy, relevance, and readability, rating responses on a 3-point scale for each category. These scores were then used to compare LLM performance. The performance of the LLMs varied across scoring categories. On accuracy and relevance, the RAG model outperformed other models with higher scores on average and the most top scores across questions. GPT-4 was more comparable to the RAG model on relevance versus accuracy. By the same measures, GPT-4 and GPT-3.5 had the highest scores for readability of answers when compared to the other LLMs. GPT-4 and 3.5 also had more answers with hallucinations than the other LLMs, due to non-existent references and inaccurate responses to clinical questions. Our findings suggest that an oncology research-focused RAG model may outperform general-purpose LLMs in accuracy and relevance when answering subject-related questions. This framework can be tailored to Q&A in other subject areas. Further research will help understand the impact of LLM architectures, RAG methodologies, and prompting techniques in answering questions across different subject areas.</p>","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"3 8","pages":"e0000568"},"PeriodicalIF":7.7000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11338460/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000568","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) have made a significant impact on the fields of general artificial intelligence. General purpose LLMs exhibit strong logic and reasoning skills and general world knowledge but can sometimes generate misleading results when prompted on specific subject areas. LLMs trained with domain-specific knowledge can reduce the generation of misleading information (i.e. hallucinations) and enhance the precision of LLMs in specialized contexts. Training new LLMs on specific corpora however can be resource intensive. Here we explored the use of a retrieval-augmented generation (RAG) model which we tested on literature specific to a biomedical research area. OpenAI's GPT-3.5, GPT-4, Microsoft's Prometheus, and a custom RAG model were used to answer 19 questions pertaining to diffuse large B-cell lymphoma (DLBCL) disease biology and treatment. Eight independent reviewers assessed LLM responses based on accuracy, relevance, and readability, rating responses on a 3-point scale for each category. These scores were then used to compare LLM performance. The performance of the LLMs varied across scoring categories. On accuracy and relevance, the RAG model outperformed other models with higher scores on average and the most top scores across questions. GPT-4 was more comparable to the RAG model on relevance versus accuracy. By the same measures, GPT-4 and GPT-3.5 had the highest scores for readability of answers when compared to the other LLMs. GPT-4 and 3.5 also had more answers with hallucinations than the other LLMs, due to non-existent references and inaccurate responses to clinical questions. Our findings suggest that an oncology research-focused RAG model may outperform general-purpose LLMs in accuracy and relevance when answering subject-related questions. This framework can be tailored to Q&A in other subject areas. Further research will help understand the impact of LLM architectures, RAG methodologies, and prompting techniques in answering questions across different subject areas.

Abstract Image

查看原文本刊更多论文

使用检索增强语言模型提高生物医学数据 GPT-3/4 结果的准确性。

大型语言模型（LLM）对通用人工智能领域产生了重大影响。通用 LLM 具备强大的逻辑推理能力和广博的世界知识，但在特定主题领域进行提示时，有时会产生误导性结果。经过特定领域知识训练的 LLMs 可以减少误导信息（即幻觉）的产生，并提高 LLMs 在特定情况下的精确度。然而，在特定语料库中训练新的 LLM 可能会耗费大量资源。在此，我们探索了使用检索增强生成（RAG）模型，并在生物医学研究领域的特定文献中进行了测试。我们使用 OpenAI 的 GPT-3.5、GPT-4、微软的 Prometheus 和自定义 RAG 模型回答了与弥漫大 B 细胞淋巴瘤（DLBCL）疾病生物学和治疗有关的 19 个问题。八位独立审稿人根据准确性、相关性和可读性对 LLM 的回答进行了评估，每类回答按 3 分制评分。然后用这些分数来比较 LLM 的性能。在不同的评分类别中，法律硕士的表现各不相同。在准确性和相关性方面，RAG 模型的表现优于其他模型，平均得分更高，而且在所有问题中得分最高。在相关性和准确性方面，GPT-4 与 RAG 模型更具有可比性。根据相同的衡量标准，与其他 LLM 相比，GPT-4 和 GPT-3.5 在答案的可读性方面得分最高。此外，GPT-4 和 3.5 中出现幻觉的答案也多于其他 LLM，原因是参考文献不存在以及对临床问题的回答不准确。我们的研究结果表明，在回答与主题相关的问题时，以肿瘤研究为重点的 RAG 模型在准确性和相关性方面可能优于通用 LLM。这一框架可根据其他学科领域的问答情况进行调整。进一步的研究将有助于了解 LLM 架构、RAG 方法和提示技术对回答不同学科领域问题的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLOS digital health

自引率

0.00%

发文量