评估大语言模型和检索增强生成增强为心血管疾病预防提供指南遵循的营养信息：横断面研究

IF 6 2区医学 Q1 HEALTH CARE SCIENCES & SERVICES

Journal of Medical Internet Research Pub Date : 2025-10-07 DOI:10.2196/78625

Vijaya Parameswaran, Jenna Bernard, Alec Bernard, Neil Deo, Sean Tsung, Kalle Lyytinen, Christopher Sharp, Fatima Rodriguez, David J Maron, Rajesh Dash

{"title":"评估大语言模型和检索增强生成增强为心血管疾病预防提供指南遵循的营养信息：横断面研究","authors":"Vijaya Parameswaran, Jenna Bernard, Alec Bernard, Neil Deo, Sean Tsung, Kalle Lyytinen, Christopher Sharp, Fatima Rodriguez, David J Maron, Rajesh Dash","doi":"10.2196/78625","DOIUrl":null,"url":null,"abstract":"Background: Cardiovascular disease (CVD) remains the leading cause of death worldwide, yet many web-based sources on cardiovascular (CV) health are inaccessible. Large language models (LLMs) are increasingly used for health-related inquiries and offer an opportunity to produce accessible and scalable CV health information. However, because these models are trained on heterogeneous data, including unverified user-generated content, the quality and reliability of food and nutrition information on CVD prevention remain uncertain. Recent studies have examined LLM use in various health care applications, but their effectiveness for providing nutrition information remains understudied. Although retrieval-augmented generation (RAG) frameworks have been shown to enhance LLM consistency and accuracy, their use in delivering nutrition information for CVD prevention requires further evaluation.Objective: To evaluate the effectiveness of off-the-shelf and RAG-enhanced LLMs in delivering guideline-adherent nutrition information for CVD prevention, we assessed 3 off-the-shelf models (ChatGPT-4o, Perplexity, and Llama 3-70B) and a Llama 3-70B+RAG model.Methods: We curated 30 nutrition questions that comprehensively addressed CVD prevention. These were approved by a registered dietitian providing preventive cardiology services at an academic medical center and were posed 3 times to each model. We developed a 15,074-word knowledge bank incorporating the American Heart Association's 2021 dietary guidelines and related website content to enhance Meta's Llama 3-70B model using RAG. The model received this and a few-shot prompt as context, included citations in a Context Source section, and used vector similarity to align responses with guideline content, with the temperature parameter set to 0.5 to enhance consistency. Model responses were evaluated by 3 expert reviewers against benchmark CV guidelines for appropriateness, reliability, readability, harm, and guideline adherence. Mean scores were compared using ANOVA, with statistical significance set at P<.05. Interrater agreement was measured using the Cohen κ coefficient, and readability was estimated using the Flesch-Kincaid readability score.Results: The Llama 3+RAG model scored higher than the Perplexity, GPT-4o, and Llama 3 models on reliability, appropriateness, guideline adherence, and readability and showed no harm. The Cohen κ coefficient (κ>70%; P<.001) indicated high reviewer agreement.Conclusions: The Llama 3+RAG model outperformed the off-the-shelf models across all measures with no evidence of harm, although the responses were less readable due to technical language. The off-the-shelf models scored lower on all measures and produced some harmful responses. These findings highlight the limitations of off-the-shelf models and demonstrate that RAG system integration can enhance LLM performance in delivering evidence-based dietary information.","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e78625"},"PeriodicalIF":6.0000,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Large Language Models and Retrieval-Augmented Generation Enhancement for Delivering Guideline-Adherent Nutrition Information for Cardiovascular Disease Prevention: Cross-Sectional Study.\",\"authors\":\"Vijaya Parameswaran, Jenna Bernard, Alec Bernard, Neil Deo, Sean Tsung, Kalle Lyytinen, Christopher Sharp, Fatima Rodriguez, David J Maron, Rajesh Dash\",\"doi\":\"10.2196/78625\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Cardiovascular disease (CVD) remains the leading cause of death worldwide, yet many web-based sources on cardiovascular (CV) health are inaccessible. Large language models (LLMs) are increasingly used for health-related inquiries and offer an opportunity to produce accessible and scalable CV health information. However, because these models are trained on heterogeneous data, including unverified user-generated content, the quality and reliability of food and nutrition information on CVD prevention remain uncertain. Recent studies have examined LLM use in various health care applications, but their effectiveness for providing nutrition information remains understudied. Although retrieval-augmented generation (RAG) frameworks have been shown to enhance LLM consistency and accuracy, their use in delivering nutrition information for CVD prevention requires further evaluation.Objective: To evaluate the effectiveness of off-the-shelf and RAG-enhanced LLMs in delivering guideline-adherent nutrition information for CVD prevention, we assessed 3 off-the-shelf models (ChatGPT-4o, Perplexity, and Llama 3-70B) and a Llama 3-70B+RAG model.Methods: We curated 30 nutrition questions that comprehensively addressed CVD prevention. These were approved by a registered dietitian providing preventive cardiology services at an academic medical center and were posed 3 times to each model. We developed a 15,074-word knowledge bank incorporating the American Heart Association's 2021 dietary guidelines and related website content to enhance Meta's Llama 3-70B model using RAG. The model received this and a few-shot prompt as context, included citations in a Context Source section, and used vector similarity to align responses with guideline content, with the temperature parameter set to 0.5 to enhance consistency. Model responses were evaluated by 3 expert reviewers against benchmark CV guidelines for appropriateness, reliability, readability, harm, and guideline adherence. Mean scores were compared using ANOVA, with statistical significance set at P<.05. Interrater agreement was measured using the Cohen κ coefficient, and readability was estimated using the Flesch-Kincaid readability score.Results: The Llama 3+RAG model scored higher than the Perplexity, GPT-4o, and Llama 3 models on reliability, appropriateness, guideline adherence, and readability and showed no harm. The Cohen κ coefficient (κ>70%; P<.001) indicated high reviewer agreement.Conclusions: The Llama 3+RAG model outperformed the off-the-shelf models across all measures with no evidence of harm, although the responses were less readable due to technical language. The off-the-shelf models scored lower on all measures and produced some harmful responses. These findings highlight the limitations of off-the-shelf models and demonstrate that RAG system integration can enhance LLM performance in delivering evidence-based dietary information.\",\"PeriodicalId\":16337,\"journal\":{\"name\":\"Journal of Medical Internet Research\",\"volume\":\"27 \",\"pages\":\"e78625\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2025-10-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Medical Internet Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/78625\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/78625","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

背景：心血管疾病（CVD）仍然是世界范围内死亡的主要原因，但许多基于网络的心血管（CV）健康信息来源无法访问。大型语言模型（llm）越来越多地用于健康相关的查询，并提供了一个产生可访问和可扩展的CV健康信息的机会。然而，由于这些模型是在异构数据上训练的，包括未经验证的用户生成的内容，因此关于心血管疾病预防的食品和营养信息的质量和可靠性仍然不确定。最近的研究已经检查了LLM在各种医疗保健应用中的使用，但它们在提供营养信息方面的有效性仍有待研究。虽然检索增强生成（RAG）框架已被证明可以提高LLM的一致性和准确性，但它们在提供心血管疾病预防营养信息方面的应用需要进一步评估。为了评估现成的和RAG增强的LLMs在提供指南营养信息以预防心血管疾病方面的有效性，我们评估了3个现成的模型（chatgpt - 40、Perplexity和Llama 3- 70b）和Llama 3- 70b +RAG模型。方法：我们整理了30个全面解决心血管疾病预防的营养问题。这些由在学术医疗中心提供预防心脏病服务的注册营养师批准，并对每个模型摆3次姿势。我们开发了一个15074字的知识库，结合美国心脏协会2021年膳食指南和相关网站内容，使用RAG增强Meta的Llama 3-70B模型。该模型将此和几个镜头提示作为上下文，包括上下文源部分中的引用，并使用向量相似性将响应与指南内容对齐，并将温度参数设置为0.5以增强一致性。模型反应由3位专家根据基准CV指南对适当性、可靠性、可读性、危害和指南依从性进行评估。采用方差分析比较平均得分，结果显示：Llama 3+RAG模型在可靠性、适宜性、指南依从性和可读性方面得分高于Perplexity、gpt - 40和Llama 3模型，且无危害。结论：Llama 3+RAG模型在所有测量中都优于现成的模型，没有证据表明存在危害，尽管由于技术语言的原因，回答的可读性较差。现成的模型在所有指标上得分都较低，并产生了一些有害的反应。这些发现突出了现成模型的局限性，并表明RAG系统集成可以提高LLM在提供循证饮食信息方面的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating Large Language Models and Retrieval-Augmented Generation Enhancement for Delivering Guideline-Adherent Nutrition Information for Cardiovascular Disease Prevention: Cross-Sectional Study.

Background: Cardiovascular disease (CVD) remains the leading cause of death worldwide, yet many web-based sources on cardiovascular (CV) health are inaccessible. Large language models (LLMs) are increasingly used for health-related inquiries and offer an opportunity to produce accessible and scalable CV health information. However, because these models are trained on heterogeneous data, including unverified user-generated content, the quality and reliability of food and nutrition information on CVD prevention remain uncertain. Recent studies have examined LLM use in various health care applications, but their effectiveness for providing nutrition information remains understudied. Although retrieval-augmented generation (RAG) frameworks have been shown to enhance LLM consistency and accuracy, their use in delivering nutrition information for CVD prevention requires further evaluation.

Objective: To evaluate the effectiveness of off-the-shelf and RAG-enhanced LLMs in delivering guideline-adherent nutrition information for CVD prevention, we assessed 3 off-the-shelf models (ChatGPT-4o, Perplexity, and Llama 3-70B) and a Llama 3-70B+RAG model.

Methods: We curated 30 nutrition questions that comprehensively addressed CVD prevention. These were approved by a registered dietitian providing preventive cardiology services at an academic medical center and were posed 3 times to each model. We developed a 15,074-word knowledge bank incorporating the American Heart Association's 2021 dietary guidelines and related website content to enhance Meta's Llama 3-70B model using RAG. The model received this and a few-shot prompt as context, included citations in a Context Source section, and used vector similarity to align responses with guideline content, with the temperature parameter set to 0.5 to enhance consistency. Model responses were evaluated by 3 expert reviewers against benchmark CV guidelines for appropriateness, reliability, readability, harm, and guideline adherence. Mean scores were compared using ANOVA, with statistical significance set at P<.05. Interrater agreement was measured using the Cohen κ coefficient, and readability was estimated using the Flesch-Kincaid readability score.

Results: The Llama 3+RAG model scored higher than the Perplexity, GPT-4o, and Llama 3 models on reliability, appropriateness, guideline adherence, and readability and showed no harm. The Cohen κ coefficient (κ>70%; P<.001) indicated high reviewer agreement.

Conclusions: The Llama 3+RAG model outperformed the off-the-shelf models across all measures with no evidence of harm, although the responses were less readable due to technical language. The off-the-shelf models scored lower on all measures and produced some harmful responses. These findings highlight the limitations of off-the-shelf models and demonstrate that RAG system integration can enhance LLM performance in delivering evidence-based dietary information.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Medical Internet Research 医学-卫生保健

CiteScore

14.40

自引率

5.40%

发文量

654

审稿时长

1 months

期刊介绍： The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades. As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor. Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.