Vijaya Parameswaran, Jenna Bernard, Alec Bernard, Neil Deo, Sean Tsung, Kalle Lyytinen, Christopher Sharp, Fatima Rodriguez, David J Maron, Rajesh Dash
{"title":"评估大语言模型和检索增强生成增强为心血管疾病预防提供指南遵循的营养信息:横断面研究","authors":"Vijaya Parameswaran, Jenna Bernard, Alec Bernard, Neil Deo, Sean Tsung, Kalle Lyytinen, Christopher Sharp, Fatima Rodriguez, David J Maron, Rajesh Dash","doi":"10.2196/78625","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Cardiovascular disease (CVD) remains the leading cause of death worldwide, yet many web-based sources on cardiovascular (CV) health are inaccessible. Large language models (LLMs) are increasingly used for health-related inquiries and offer an opportunity to produce accessible and scalable CV health information. However, because these models are trained on heterogeneous data, including unverified user-generated content, the quality and reliability of food and nutrition information on CVD prevention remain uncertain. Recent studies have examined LLM use in various health care applications, but their effectiveness for providing nutrition information remains understudied. Although retrieval-augmented generation (RAG) frameworks have been shown to enhance LLM consistency and accuracy, their use in delivering nutrition information for CVD prevention requires further evaluation.</p><p><strong>Objective: </strong>To evaluate the effectiveness of off-the-shelf and RAG-enhanced LLMs in delivering guideline-adherent nutrition information for CVD prevention, we assessed 3 off-the-shelf models (ChatGPT-4o, Perplexity, and Llama 3-70B) and a Llama 3-70B+RAG model.</p><p><strong>Methods: </strong>We curated 30 nutrition questions that comprehensively addressed CVD prevention. These were approved by a registered dietitian providing preventive cardiology services at an academic medical center and were posed 3 times to each model. We developed a 15,074-word knowledge bank incorporating the American Heart Association's 2021 dietary guidelines and related website content to enhance Meta's Llama 3-70B model using RAG. The model received this and a few-shot prompt as context, included citations in a Context Source section, and used vector similarity to align responses with guideline content, with the temperature parameter set to 0.5 to enhance consistency. Model responses were evaluated by 3 expert reviewers against benchmark CV guidelines for appropriateness, reliability, readability, harm, and guideline adherence. Mean scores were compared using ANOVA, with statistical significance set at P<.05. Interrater agreement was measured using the Cohen κ coefficient, and readability was estimated using the Flesch-Kincaid readability score.</p><p><strong>Results: </strong>The Llama 3+RAG model scored higher than the Perplexity, GPT-4o, and Llama 3 models on reliability, appropriateness, guideline adherence, and readability and showed no harm. The Cohen κ coefficient (κ>70%; P<.001) indicated high reviewer agreement.</p><p><strong>Conclusions: </strong>The Llama 3+RAG model outperformed the off-the-shelf models across all measures with no evidence of harm, although the responses were less readable due to technical language. The off-the-shelf models scored lower on all measures and produced some harmful responses. These findings highlight the limitations of off-the-shelf models and demonstrate that RAG system integration can enhance LLM performance in delivering evidence-based dietary information.</p>","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e78625"},"PeriodicalIF":6.0000,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Large Language Models and Retrieval-Augmented Generation Enhancement for Delivering Guideline-Adherent Nutrition Information for Cardiovascular Disease Prevention: Cross-Sectional Study.\",\"authors\":\"Vijaya Parameswaran, Jenna Bernard, Alec Bernard, Neil Deo, Sean Tsung, Kalle Lyytinen, Christopher Sharp, Fatima Rodriguez, David J Maron, Rajesh Dash\",\"doi\":\"10.2196/78625\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Cardiovascular disease (CVD) remains the leading cause of death worldwide, yet many web-based sources on cardiovascular (CV) health are inaccessible. Large language models (LLMs) are increasingly used for health-related inquiries and offer an opportunity to produce accessible and scalable CV health information. However, because these models are trained on heterogeneous data, including unverified user-generated content, the quality and reliability of food and nutrition information on CVD prevention remain uncertain. Recent studies have examined LLM use in various health care applications, but their effectiveness for providing nutrition information remains understudied. Although retrieval-augmented generation (RAG) frameworks have been shown to enhance LLM consistency and accuracy, their use in delivering nutrition information for CVD prevention requires further evaluation.</p><p><strong>Objective: </strong>To evaluate the effectiveness of off-the-shelf and RAG-enhanced LLMs in delivering guideline-adherent nutrition information for CVD prevention, we assessed 3 off-the-shelf models (ChatGPT-4o, Perplexity, and Llama 3-70B) and a Llama 3-70B+RAG model.</p><p><strong>Methods: </strong>We curated 30 nutrition questions that comprehensively addressed CVD prevention. These were approved by a registered dietitian providing preventive cardiology services at an academic medical center and were posed 3 times to each model. We developed a 15,074-word knowledge bank incorporating the American Heart Association's 2021 dietary guidelines and related website content to enhance Meta's Llama 3-70B model using RAG. The model received this and a few-shot prompt as context, included citations in a Context Source section, and used vector similarity to align responses with guideline content, with the temperature parameter set to 0.5 to enhance consistency. Model responses were evaluated by 3 expert reviewers against benchmark CV guidelines for appropriateness, reliability, readability, harm, and guideline adherence. Mean scores were compared using ANOVA, with statistical significance set at P<.05. Interrater agreement was measured using the Cohen κ coefficient, and readability was estimated using the Flesch-Kincaid readability score.</p><p><strong>Results: </strong>The Llama 3+RAG model scored higher than the Perplexity, GPT-4o, and Llama 3 models on reliability, appropriateness, guideline adherence, and readability and showed no harm. The Cohen κ coefficient (κ>70%; P<.001) indicated high reviewer agreement.</p><p><strong>Conclusions: </strong>The Llama 3+RAG model outperformed the off-the-shelf models across all measures with no evidence of harm, although the responses were less readable due to technical language. The off-the-shelf models scored lower on all measures and produced some harmful responses. These findings highlight the limitations of off-the-shelf models and demonstrate that RAG system integration can enhance LLM performance in delivering evidence-based dietary information.</p>\",\"PeriodicalId\":16337,\"journal\":{\"name\":\"Journal of Medical Internet Research\",\"volume\":\"27 \",\"pages\":\"e78625\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2025-10-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Medical Internet Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/78625\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/78625","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
Evaluating Large Language Models and Retrieval-Augmented Generation Enhancement for Delivering Guideline-Adherent Nutrition Information for Cardiovascular Disease Prevention: Cross-Sectional Study.
Background: Cardiovascular disease (CVD) remains the leading cause of death worldwide, yet many web-based sources on cardiovascular (CV) health are inaccessible. Large language models (LLMs) are increasingly used for health-related inquiries and offer an opportunity to produce accessible and scalable CV health information. However, because these models are trained on heterogeneous data, including unverified user-generated content, the quality and reliability of food and nutrition information on CVD prevention remain uncertain. Recent studies have examined LLM use in various health care applications, but their effectiveness for providing nutrition information remains understudied. Although retrieval-augmented generation (RAG) frameworks have been shown to enhance LLM consistency and accuracy, their use in delivering nutrition information for CVD prevention requires further evaluation.
Objective: To evaluate the effectiveness of off-the-shelf and RAG-enhanced LLMs in delivering guideline-adherent nutrition information for CVD prevention, we assessed 3 off-the-shelf models (ChatGPT-4o, Perplexity, and Llama 3-70B) and a Llama 3-70B+RAG model.
Methods: We curated 30 nutrition questions that comprehensively addressed CVD prevention. These were approved by a registered dietitian providing preventive cardiology services at an academic medical center and were posed 3 times to each model. We developed a 15,074-word knowledge bank incorporating the American Heart Association's 2021 dietary guidelines and related website content to enhance Meta's Llama 3-70B model using RAG. The model received this and a few-shot prompt as context, included citations in a Context Source section, and used vector similarity to align responses with guideline content, with the temperature parameter set to 0.5 to enhance consistency. Model responses were evaluated by 3 expert reviewers against benchmark CV guidelines for appropriateness, reliability, readability, harm, and guideline adherence. Mean scores were compared using ANOVA, with statistical significance set at P<.05. Interrater agreement was measured using the Cohen κ coefficient, and readability was estimated using the Flesch-Kincaid readability score.
Results: The Llama 3+RAG model scored higher than the Perplexity, GPT-4o, and Llama 3 models on reliability, appropriateness, guideline adherence, and readability and showed no harm. The Cohen κ coefficient (κ>70%; P<.001) indicated high reviewer agreement.
Conclusions: The Llama 3+RAG model outperformed the off-the-shelf models across all measures with no evidence of harm, although the responses were less readable due to technical language. The off-the-shelf models scored lower on all measures and produced some harmful responses. These findings highlight the limitations of off-the-shelf models and demonstrate that RAG system integration can enhance LLM performance in delivering evidence-based dietary information.
期刊介绍:
The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades.
As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor.
Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.