Performance of GPT-4o combined with retrieval-augmented generation on nutritionist licensing exam questions.

IF 2.1 4区 医学 Q4 ENDOCRINOLOGY & METABOLISM
Yu Ishikawa, Akitaka Higashi, Nozomu Arai, Daisuke Ozo, Wataru Hasegawa, Tetsuya Imamura, Zenbei Matsumoto, Hidetaka Nambo, Shigehiro Karashima
{"title":"Performance of GPT-4o combined with retrieval-augmented generation on nutritionist licensing exam questions.","authors":"Yu Ishikawa, Akitaka Higashi, Nozomu Arai, Daisuke Ozo, Wataru Hasegawa, Tetsuya Imamura, Zenbei Matsumoto, Hidetaka Nambo, Shigehiro Karashima","doi":"10.1507/endocrj.EJ25-0201","DOIUrl":null,"url":null,"abstract":"<p><p>GPT-4o, a general-purpose large language model, has a Retrieval-Augmented Variant (GPT-4o-RAG) that can assist in dietary counseling. However, research on its application in this field remains lacking. To bridge this gap, we used the Japanese National Examination for Registered Dietitians as a standardized benchmark for evaluation. Three language models-GPT-4o, GPT-4o-mini, and GPT-4o-RAG-were assessed using 599 publicly available multiple-choice questions from the 2022-2024 national examinations. For each model, we generated answers to each question five times and based our evaluation on these multiple outputs to assess response variability and robustness. A custom pipeline was implemented for GPT-4o-RAG to retrieve guideline-based documents for integration with GPT-generated responses. Accuracy rates, variance, and response consistency were evaluated. Term Frequency-Inverse Document Frequency analysis was conducted to compare word characteristics in correctly and incorrectly answered questions. All three models achieved accuracy rates >60%, the passing threshold. GPT-4o-RAG demonstrated the highest accuracy (83.5% ± 0.3%), followed by GPT-4o (82.1% ± 1.0%), and GPT-4o-mini (70.0% ± 1.4%). While the accuracy improvement of GPT-4o-RAG over GPT-4o was not statistically significant (p = 0.12), it exhibited significantly lower variance and higher response consistency (97.3% vs. 91.2-95.2%, p < 0.001). GPT-4o-RAG outperformed other models in applied and clinical nutrition categories but showed limited performance on numerical questions. Term Frequency-Inverse Document Frequency analysis suggested that incorrect answers were more frequently associated with numerical terms. GPT-4o-RAG improved response consistency and domain-specific performance, suggesting utility in clinical nutrition. However, limitations in numerical reasoning and individualized guidance warrant further development and validation.</p>","PeriodicalId":11631,"journal":{"name":"Endocrine journal","volume":" ","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Endocrine journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1507/endocrj.EJ25-0201","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}
引用次数: 0

Abstract

GPT-4o, a general-purpose large language model, has a Retrieval-Augmented Variant (GPT-4o-RAG) that can assist in dietary counseling. However, research on its application in this field remains lacking. To bridge this gap, we used the Japanese National Examination for Registered Dietitians as a standardized benchmark for evaluation. Three language models-GPT-4o, GPT-4o-mini, and GPT-4o-RAG-were assessed using 599 publicly available multiple-choice questions from the 2022-2024 national examinations. For each model, we generated answers to each question five times and based our evaluation on these multiple outputs to assess response variability and robustness. A custom pipeline was implemented for GPT-4o-RAG to retrieve guideline-based documents for integration with GPT-generated responses. Accuracy rates, variance, and response consistency were evaluated. Term Frequency-Inverse Document Frequency analysis was conducted to compare word characteristics in correctly and incorrectly answered questions. All three models achieved accuracy rates >60%, the passing threshold. GPT-4o-RAG demonstrated the highest accuracy (83.5% ± 0.3%), followed by GPT-4o (82.1% ± 1.0%), and GPT-4o-mini (70.0% ± 1.4%). While the accuracy improvement of GPT-4o-RAG over GPT-4o was not statistically significant (p = 0.12), it exhibited significantly lower variance and higher response consistency (97.3% vs. 91.2-95.2%, p < 0.001). GPT-4o-RAG outperformed other models in applied and clinical nutrition categories but showed limited performance on numerical questions. Term Frequency-Inverse Document Frequency analysis suggested that incorrect answers were more frequently associated with numerical terms. GPT-4o-RAG improved response consistency and domain-specific performance, suggesting utility in clinical nutrition. However, limitations in numerical reasoning and individualized guidance warrant further development and validation.

gpt - 40结合检索增强生成在营养师执照考试试题中的表现。
gpt - 40是一个通用的大型语言模型,它有一个检索增强变体(gpt - 40 - rag),可以帮助饮食咨询。然而,对其在该领域的应用研究仍然缺乏。为了弥补这一差距,我们使用了日本国家注册营养师考试作为评估的标准化基准。gpt - 40、gpt - 40 -mini和gpt - 40 - rag三种语言模型使用2022-2024年国家考试中的599个公开选择题进行评估。对于每个模型,我们对每个问题生成五次答案,并基于这些多个输出来评估响应的可变性和稳健性。为gpt - 40 - rag实现了一个自定义管道,用于检索基于指南的文档,以便与gpt生成的响应集成。评估准确率、方差和反应一致性。通过词频-逆文献频分析,比较正确和错误回答问题的词频特征。所有三种模型的准确率都达到了60%,即通过阈值。gpt - 40 - rag准确度最高(83.5%±0.3%),其次为gpt - 40(82.1%±1.0%)和gpt - 40 -mini(70.0%±1.4%)。虽然gpt - 40 - rag比gpt - 40的准确率提高无统计学意义(p = 0.12),但其方差显著降低,反应一致性显著提高(97.3%比91.2-95.2%,p < 0.001)。gpt - 40 - rag在应用和临床营养类别中表现优于其他模型,但在数值问题上表现有限。术语频率-反向文档频率分析表明,错误答案更多地与数字术语相关。gpt - 40 - rag改善了反应一致性和领域特异性表现,表明在临床营养方面的实用性。然而,数值推理和个性化指导的局限性需要进一步发展和验证。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Endocrine journal
Endocrine journal 医学-内分泌学与代谢
CiteScore
4.30
自引率
5.00%
发文量
224
审稿时长
1.5 months
期刊介绍: Endocrine Journal is an open access, peer-reviewed online journal with a long history. This journal publishes peer-reviewed research articles in multifaceted fields of basic, translational and clinical endocrinology. Endocrine Journal provides a chance to exchange your ideas, concepts and scientific observations in any area of recent endocrinology. Manuscripts may be submitted as Original Articles, Notes, Rapid Communications or Review Articles. We have a rapid reviewing and editorial decision system and pay a special attention to our quick, truly scientific and frequently-citable publication. Please go through the link for author guideline.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信