Jonatan Fridolfsson , Emma Sjöberg , Meri Thiwång , Stefan Pettersson
{"title":"食品图像营养成分估计的3种大型语言模型性能评价","authors":"Jonatan Fridolfsson , Emma Sjöberg , Meri Thiwång , Stefan Pettersson","doi":"10.1016/j.cdnut.2025.107556","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Traditional dietary assessment methods face limitations including recall bias, participant burden, and portion size estimation errors. Recent advances in artificial intelligence, particularly multimodal large language models (LLMs), offer potential solutions for automated nutritional analysis from food images.</div></div><div><h3>Objectives</h3><div>This study aims to evaluate and compare the performance of 3 leading LLMs (ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro) in estimating food weight, energy content, and macronutrient composition from standardized food photographs.</div></div><div><h3>Methods</h3><div>We analyzed 52 standardized food photographs including individual food components (<em>n</em> = 16) and complete meals (<em>n</em> = 36) in 3 portion sizes (small, medium, large). Each model received identical prompts to identify food components and estimate nutritional content using visible cutlery and plates as size references. Model estimates were compared against reference values obtained through direct weighing and nutritional database analysis (Dietist NET). Performance metrics included mean absolute percentage error (MAPE), Pearson correlations, and systematic bias analysis using Bland–Altman plots.</div></div><div><h3>Results</h3><div>ChatGPT and Claude demonstrated similar accuracy with MAPE values of 36.3% and 37.3% for weight estimation, and 35.8% for energy estimation. Gemini showed substantially higher errors across all nutrients (MAPE 64.2%–109.9%). Correlations between model estimates and reference values ranged from 0.65 to 0.81 for ChatGPT and Claude, compared with 0.58–0.73 for Gemini. All models exhibited systematic underestimation that increased with portion size, with bias slopes ranging from –0.23 to –0.50.</div></div><div><h3>Conclusions</h3><div>ChatGPT and Claude achieved accuracy levels comparable with traditional self-reported dietary assessment methods but without associated user burden, suggesting potential utility as dietary monitoring tools. However, systematic underestimation of large portions and high variability in macronutrient estimation indicate these general-purpose LLMs are not yet suitable for precise dietary assessment in clinical or athletic populations where accurate quantification is critical.</div></div>","PeriodicalId":10756,"journal":{"name":"Current Developments in Nutrition","volume":"9 10","pages":"Article 107556"},"PeriodicalIF":3.2000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Evaluation of 3 Large Language Models for Nutritional Content Estimation from Food Images\",\"authors\":\"Jonatan Fridolfsson , Emma Sjöberg , Meri Thiwång , Stefan Pettersson\",\"doi\":\"10.1016/j.cdnut.2025.107556\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Traditional dietary assessment methods face limitations including recall bias, participant burden, and portion size estimation errors. Recent advances in artificial intelligence, particularly multimodal large language models (LLMs), offer potential solutions for automated nutritional analysis from food images.</div></div><div><h3>Objectives</h3><div>This study aims to evaluate and compare the performance of 3 leading LLMs (ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro) in estimating food weight, energy content, and macronutrient composition from standardized food photographs.</div></div><div><h3>Methods</h3><div>We analyzed 52 standardized food photographs including individual food components (<em>n</em> = 16) and complete meals (<em>n</em> = 36) in 3 portion sizes (small, medium, large). Each model received identical prompts to identify food components and estimate nutritional content using visible cutlery and plates as size references. Model estimates were compared against reference values obtained through direct weighing and nutritional database analysis (Dietist NET). Performance metrics included mean absolute percentage error (MAPE), Pearson correlations, and systematic bias analysis using Bland–Altman plots.</div></div><div><h3>Results</h3><div>ChatGPT and Claude demonstrated similar accuracy with MAPE values of 36.3% and 37.3% for weight estimation, and 35.8% for energy estimation. Gemini showed substantially higher errors across all nutrients (MAPE 64.2%–109.9%). Correlations between model estimates and reference values ranged from 0.65 to 0.81 for ChatGPT and Claude, compared with 0.58–0.73 for Gemini. All models exhibited systematic underestimation that increased with portion size, with bias slopes ranging from –0.23 to –0.50.</div></div><div><h3>Conclusions</h3><div>ChatGPT and Claude achieved accuracy levels comparable with traditional self-reported dietary assessment methods but without associated user burden, suggesting potential utility as dietary monitoring tools. However, systematic underestimation of large portions and high variability in macronutrient estimation indicate these general-purpose LLMs are not yet suitable for precise dietary assessment in clinical or athletic populations where accurate quantification is critical.</div></div>\",\"PeriodicalId\":10756,\"journal\":{\"name\":\"Current Developments in Nutrition\",\"volume\":\"9 10\",\"pages\":\"Article 107556\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Current Developments in Nutrition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2475299125030185\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"NUTRITION & DIETETICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Developments in Nutrition","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2475299125030185","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"NUTRITION & DIETETICS","Score":null,"Total":0}
Performance Evaluation of 3 Large Language Models for Nutritional Content Estimation from Food Images
Background
Traditional dietary assessment methods face limitations including recall bias, participant burden, and portion size estimation errors. Recent advances in artificial intelligence, particularly multimodal large language models (LLMs), offer potential solutions for automated nutritional analysis from food images.
Objectives
This study aims to evaluate and compare the performance of 3 leading LLMs (ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro) in estimating food weight, energy content, and macronutrient composition from standardized food photographs.
Methods
We analyzed 52 standardized food photographs including individual food components (n = 16) and complete meals (n = 36) in 3 portion sizes (small, medium, large). Each model received identical prompts to identify food components and estimate nutritional content using visible cutlery and plates as size references. Model estimates were compared against reference values obtained through direct weighing and nutritional database analysis (Dietist NET). Performance metrics included mean absolute percentage error (MAPE), Pearson correlations, and systematic bias analysis using Bland–Altman plots.
Results
ChatGPT and Claude demonstrated similar accuracy with MAPE values of 36.3% and 37.3% for weight estimation, and 35.8% for energy estimation. Gemini showed substantially higher errors across all nutrients (MAPE 64.2%–109.9%). Correlations between model estimates and reference values ranged from 0.65 to 0.81 for ChatGPT and Claude, compared with 0.58–0.73 for Gemini. All models exhibited systematic underestimation that increased with portion size, with bias slopes ranging from –0.23 to –0.50.
Conclusions
ChatGPT and Claude achieved accuracy levels comparable with traditional self-reported dietary assessment methods but without associated user burden, suggesting potential utility as dietary monitoring tools. However, systematic underestimation of large portions and high variability in macronutrient estimation indicate these general-purpose LLMs are not yet suitable for precise dietary assessment in clinical or athletic populations where accurate quantification is critical.