Evaluating AI-Generated Meal Plans for Simulated Diabetes Profiles: A Guideline-Based Comparison of Three Language Models.

IF 2.1 4区 医学 Q3 HEALTH CARE SCIENCES & SERVICES
Hatice Merve Bayram, Sedat Arslan, Arda Ozturkcan
{"title":"Evaluating AI-Generated Meal Plans for Simulated Diabetes Profiles: A Guideline-Based Comparison of Three Language Models.","authors":"Hatice Merve Bayram, Sedat Arslan, Arda Ozturkcan","doi":"10.1111/jep.70295","DOIUrl":null,"url":null,"abstract":"<p><strong>Aims: </strong>This synthetic simulation, using no real patient data, study aimed to evaluate and compare the performance of three prominent large language models (LLMs)-ChatGPT-4.1, Grok-3 and DeepSeek-in generating medical nutrition therapy aligned dietary plans for adults with type 2 diabetes mellitus (T2DM).</p><p><strong>Methods: </strong>A simulation-based design was employed using 24 standardized virtual patient profiles differentiated by gender and body mass index (BMI) category. Each LLM was prompted in Turkish to generate 3-day meal plans. Outputs were assessed for energy and macro-/micronutrient accuracy, adherence to national and international T2DM guidelines and alignment with the nutrition care process (NCP).</p><p><strong>Results: </strong>ChatGPT-4.1 showed the highest alignment with energy requirements (70.9%) but overestimated fat intake. Grok-3 demonstrated superior energy accuracy (83.1%) but failed to meet several micronutrient targets. DeepSeek adjusted protein intake according to BMI but underdelivered carbohydrates. None of the models demonstrated full concordance with the NCP framework, particularly in the diagnosis and monitoring components. Frequent hallucinations and lack of clinical contextualization were noted. Integration of retrieval-augmented generation (RAG) was identified as a potential improvement strategy.</p><p><strong>Conclusion: </strong>While LLMs showed promise in generating baseline dietary guidance in a simulated context, these results reflected concordance with guideline documents only and concordance with guideline documents only and should not be interpreted as evidence of equivalence to dietitian-led care. These findings reflected model behaviour in synthetic scenarios only and highlighted the need for RAG integration and expert supervision before any clinical application.</p>","PeriodicalId":15997,"journal":{"name":"Journal of evaluation in clinical practice","volume":"31 7","pages":"e70295"},"PeriodicalIF":2.1000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of evaluation in clinical practice","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/jep.70295","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Aims: This synthetic simulation, using no real patient data, study aimed to evaluate and compare the performance of three prominent large language models (LLMs)-ChatGPT-4.1, Grok-3 and DeepSeek-in generating medical nutrition therapy aligned dietary plans for adults with type 2 diabetes mellitus (T2DM).

Methods: A simulation-based design was employed using 24 standardized virtual patient profiles differentiated by gender and body mass index (BMI) category. Each LLM was prompted in Turkish to generate 3-day meal plans. Outputs were assessed for energy and macro-/micronutrient accuracy, adherence to national and international T2DM guidelines and alignment with the nutrition care process (NCP).

Results: ChatGPT-4.1 showed the highest alignment with energy requirements (70.9%) but overestimated fat intake. Grok-3 demonstrated superior energy accuracy (83.1%) but failed to meet several micronutrient targets. DeepSeek adjusted protein intake according to BMI but underdelivered carbohydrates. None of the models demonstrated full concordance with the NCP framework, particularly in the diagnosis and monitoring components. Frequent hallucinations and lack of clinical contextualization were noted. Integration of retrieval-augmented generation (RAG) was identified as a potential improvement strategy.

Conclusion: While LLMs showed promise in generating baseline dietary guidance in a simulated context, these results reflected concordance with guideline documents only and concordance with guideline documents only and should not be interpreted as evidence of equivalence to dietitian-led care. These findings reflected model behaviour in synthetic scenarios only and highlighted the need for RAG integration and expert supervision before any clinical application.

评估人工智能生成的模拟糖尿病膳食计划:三种语言模型的基于指南的比较
目的:本研究在不使用真实患者数据的情况下,旨在评估和比较三种著名的大语言模型(LLMs)——chatgpt -4.1、Grok-3和deepseek——在为2型糖尿病(T2DM)患者制定医疗营养治疗膳食计划方面的性能。方法:采用基于模拟的设计,使用24个按性别和体重指数(BMI)分类的标准化虚拟患者档案。用土耳其语提示每个LLM生成3天的膳食计划。评估了输出的能量和宏量/微量营养素的准确性,对国家和国际T2DM指南的遵守以及与营养护理过程(NCP)的一致性。结果:ChatGPT-4.1显示了与能量需求的最高一致性(70.9%),但高估了脂肪摄入量。Grok-3表现出优异的能量准确度(83.1%),但未能满足几种微量营养素指标。DeepSeek根据体重指数调整了蛋白质摄入量,但碳水化合物的摄入量不足。没有一个模型显示出与新冠肺炎框架完全一致,特别是在诊断和监测部分。注意到频繁的幻觉和缺乏临床情境化。整合检索增强生成(RAG)被认为是一种潜在的改进策略。结论:虽然llm在模拟环境中显示出生成基线饮食指导的希望,但这些结果仅反映了与指南文件的一致性,并且仅反映了与指南文件的一致性,不应被解释为与营养师主导的护理等效的证据。这些发现仅反映了合成情况下的模型行为,并强调了在任何临床应用之前需要进行RAG整合和专家监督。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
4.80
自引率
4.20%
发文量
143
审稿时长
3-8 weeks
期刊介绍: The Journal of Evaluation in Clinical Practice aims to promote the evaluation and development of clinical practice across medicine, nursing and the allied health professions. All aspects of health services research and public health policy analysis and debate are of interest to the Journal whether studied from a population-based or individual patient-centred perspective. Of particular interest to the Journal are submissions on all aspects of clinical effectiveness and efficiency including evidence-based medicine, clinical practice guidelines, clinical decision making, clinical services organisation, implementation and delivery, health economic evaluation, health process and outcome measurement and new or improved methods (conceptual and statistical) for systematic inquiry into clinical practice. Papers may take a classical quantitative or qualitative approach to investigation (or may utilise both techniques) or may take the form of learned essays, structured/systematic reviews and critiques.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信