{"title":"Evaluating AI-Generated Meal Plans for Simulated Diabetes Profiles: A Guideline-Based Comparison of Three Language Models.","authors":"Hatice Merve Bayram, Sedat Arslan, Arda Ozturkcan","doi":"10.1111/jep.70295","DOIUrl":null,"url":null,"abstract":"<p><strong>Aims: </strong>This synthetic simulation, using no real patient data, study aimed to evaluate and compare the performance of three prominent large language models (LLMs)-ChatGPT-4.1, Grok-3 and DeepSeek-in generating medical nutrition therapy aligned dietary plans for adults with type 2 diabetes mellitus (T2DM).</p><p><strong>Methods: </strong>A simulation-based design was employed using 24 standardized virtual patient profiles differentiated by gender and body mass index (BMI) category. Each LLM was prompted in Turkish to generate 3-day meal plans. Outputs were assessed for energy and macro-/micronutrient accuracy, adherence to national and international T2DM guidelines and alignment with the nutrition care process (NCP).</p><p><strong>Results: </strong>ChatGPT-4.1 showed the highest alignment with energy requirements (70.9%) but overestimated fat intake. Grok-3 demonstrated superior energy accuracy (83.1%) but failed to meet several micronutrient targets. DeepSeek adjusted protein intake according to BMI but underdelivered carbohydrates. None of the models demonstrated full concordance with the NCP framework, particularly in the diagnosis and monitoring components. Frequent hallucinations and lack of clinical contextualization were noted. Integration of retrieval-augmented generation (RAG) was identified as a potential improvement strategy.</p><p><strong>Conclusion: </strong>While LLMs showed promise in generating baseline dietary guidance in a simulated context, these results reflected concordance with guideline documents only and concordance with guideline documents only and should not be interpreted as evidence of equivalence to dietitian-led care. These findings reflected model behaviour in synthetic scenarios only and highlighted the need for RAG integration and expert supervision before any clinical application.</p>","PeriodicalId":15997,"journal":{"name":"Journal of evaluation in clinical practice","volume":"31 7","pages":"e70295"},"PeriodicalIF":2.1000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of evaluation in clinical practice","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/jep.70295","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Aims: This synthetic simulation, using no real patient data, study aimed to evaluate and compare the performance of three prominent large language models (LLMs)-ChatGPT-4.1, Grok-3 and DeepSeek-in generating medical nutrition therapy aligned dietary plans for adults with type 2 diabetes mellitus (T2DM).
Methods: A simulation-based design was employed using 24 standardized virtual patient profiles differentiated by gender and body mass index (BMI) category. Each LLM was prompted in Turkish to generate 3-day meal plans. Outputs were assessed for energy and macro-/micronutrient accuracy, adherence to national and international T2DM guidelines and alignment with the nutrition care process (NCP).
Results: ChatGPT-4.1 showed the highest alignment with energy requirements (70.9%) but overestimated fat intake. Grok-3 demonstrated superior energy accuracy (83.1%) but failed to meet several micronutrient targets. DeepSeek adjusted protein intake according to BMI but underdelivered carbohydrates. None of the models demonstrated full concordance with the NCP framework, particularly in the diagnosis and monitoring components. Frequent hallucinations and lack of clinical contextualization were noted. Integration of retrieval-augmented generation (RAG) was identified as a potential improvement strategy.
Conclusion: While LLMs showed promise in generating baseline dietary guidance in a simulated context, these results reflected concordance with guideline documents only and concordance with guideline documents only and should not be interpreted as evidence of equivalence to dietitian-led care. These findings reflected model behaviour in synthetic scenarios only and highlighted the need for RAG integration and expert supervision before any clinical application.
期刊介绍:
The Journal of Evaluation in Clinical Practice aims to promote the evaluation and development of clinical practice across medicine, nursing and the allied health professions. All aspects of health services research and public health policy analysis and debate are of interest to the Journal whether studied from a population-based or individual patient-centred perspective. Of particular interest to the Journal are submissions on all aspects of clinical effectiveness and efficiency including evidence-based medicine, clinical practice guidelines, clinical decision making, clinical services organisation, implementation and delivery, health economic evaluation, health process and outcome measurement and new or improved methods (conceptual and statistical) for systematic inquiry into clinical practice. Papers may take a classical quantitative or qualitative approach to investigation (or may utilise both techniques) or may take the form of learned essays, structured/systematic reviews and critiques.