gpt - 40结合检索增强生成在营养师执照考试试题中的表现。

IF 2.1 4区医学 Q4 ENDOCRINOLOGY & METABOLISM

Endocrine journal Pub Date : 2025-09-11 DOI:10.1507/endocrj.EJ25-0201

Yu Ishikawa, Akitaka Higashi, Nozomu Arai, Daisuke Ozo, Wataru Hasegawa, Tetsuya Imamura, Zenbei Matsumoto, Hidetaka Nambo, Shigehiro Karashima

{"title":"gpt - 40结合检索增强生成在营养师执照考试试题中的表现。","authors":"Yu Ishikawa, Akitaka Higashi, Nozomu Arai, Daisuke Ozo, Wataru Hasegawa, Tetsuya Imamura, Zenbei Matsumoto, Hidetaka Nambo, Shigehiro Karashima","doi":"10.1507/endocrj.EJ25-0201","DOIUrl":null,"url":null,"abstract":"GPT-4o, a general-purpose large language model, has a Retrieval-Augmented Variant (GPT-4o-RAG) that can assist in dietary counseling. However, research on its application in this field remains lacking. To bridge this gap, we used the Japanese National Examination for Registered Dietitians as a standardized benchmark for evaluation. Three language models-GPT-4o, GPT-4o-mini, and GPT-4o-RAG-were assessed using 599 publicly available multiple-choice questions from the 2022-2024 national examinations. For each model, we generated answers to each question five times and based our evaluation on these multiple outputs to assess response variability and robustness. A custom pipeline was implemented for GPT-4o-RAG to retrieve guideline-based documents for integration with GPT-generated responses. Accuracy rates, variance, and response consistency were evaluated. Term Frequency-Inverse Document Frequency analysis was conducted to compare word characteristics in correctly and incorrectly answered questions. All three models achieved accuracy rates >60%, the passing threshold. GPT-4o-RAG demonstrated the highest accuracy (83.5% ± 0.3%), followed by GPT-4o (82.1% ± 1.0%), and GPT-4o-mini (70.0% ± 1.4%). While the accuracy improvement of GPT-4o-RAG over GPT-4o was not statistically significant (p = 0.12), it exhibited significantly lower variance and higher response consistency (97.3% vs. 91.2-95.2%, p < 0.001). GPT-4o-RAG outperformed other models in applied and clinical nutrition categories but showed limited performance on numerical questions. Term Frequency-Inverse Document Frequency analysis suggested that incorrect answers were more frequently associated with numerical terms. GPT-4o-RAG improved response consistency and domain-specific performance, suggesting utility in clinical nutrition. However, limitations in numerical reasoning and individualized guidance warrant further development and validation.","PeriodicalId":11631,"journal":{"name":"Endocrine journal","volume":" ","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance of GPT-4o combined with retrieval-augmented generation on nutritionist licensing exam questions.\",\"authors\":\"Yu Ishikawa, Akitaka Higashi, Nozomu Arai, Daisuke Ozo, Wataru Hasegawa, Tetsuya Imamura, Zenbei Matsumoto, Hidetaka Nambo, Shigehiro Karashima\",\"doi\":\"10.1507/endocrj.EJ25-0201\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"GPT-4o, a general-purpose large language model, has a Retrieval-Augmented Variant (GPT-4o-RAG) that can assist in dietary counseling. However, research on its application in this field remains lacking. To bridge this gap, we used the Japanese National Examination for Registered Dietitians as a standardized benchmark for evaluation. Three language models-GPT-4o, GPT-4o-mini, and GPT-4o-RAG-were assessed using 599 publicly available multiple-choice questions from the 2022-2024 national examinations. For each model, we generated answers to each question five times and based our evaluation on these multiple outputs to assess response variability and robustness. A custom pipeline was implemented for GPT-4o-RAG to retrieve guideline-based documents for integration with GPT-generated responses. Accuracy rates, variance, and response consistency were evaluated. Term Frequency-Inverse Document Frequency analysis was conducted to compare word characteristics in correctly and incorrectly answered questions. All three models achieved accuracy rates >60%, the passing threshold. GPT-4o-RAG demonstrated the highest accuracy (83.5% ± 0.3%), followed by GPT-4o (82.1% ± 1.0%), and GPT-4o-mini (70.0% ± 1.4%). While the accuracy improvement of GPT-4o-RAG over GPT-4o was not statistically significant (p = 0.12), it exhibited significantly lower variance and higher response consistency (97.3% vs. 91.2-95.2%, p < 0.001). GPT-4o-RAG outperformed other models in applied and clinical nutrition categories but showed limited performance on numerical questions. Term Frequency-Inverse Document Frequency analysis suggested that incorrect answers were more frequently associated with numerical terms. GPT-4o-RAG improved response consistency and domain-specific performance, suggesting utility in clinical nutrition. However, limitations in numerical reasoning and individualized guidance warrant further development and validation.\",\"PeriodicalId\":11631,\"journal\":{\"name\":\"Endocrine journal\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2025-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Endocrine journal\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1507/endocrj.EJ25-0201\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ENDOCRINOLOGY & METABOLISM\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Endocrine journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1507/endocrj.EJ25-0201","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}

引用次数: 0

摘要

gpt - 40是一个通用的大型语言模型，它有一个检索增强变体（gpt - 40 - rag），可以帮助饮食咨询。然而，对其在该领域的应用研究仍然缺乏。为了弥补这一差距，我们使用了日本国家注册营养师考试作为评估的标准化基准。gpt - 40、gpt - 40 -mini和gpt - 40 - rag三种语言模型使用2022-2024年国家考试中的599个公开选择题进行评估。对于每个模型，我们对每个问题生成五次答案，并基于这些多个输出来评估响应的可变性和稳健性。为gpt - 40 - rag实现了一个自定义管道，用于检索基于指南的文档，以便与gpt生成的响应集成。评估准确率、方差和反应一致性。通过词频-逆文献频分析，比较正确和错误回答问题的词频特征。所有三种模型的准确率都达到了60%，即通过阈值。gpt - 40 - rag准确度最高（83.5%±0.3%），其次为gpt - 40（82.1%±1.0%）和gpt - 40 -mini（70.0%±1.4%）。虽然gpt - 40 - rag比gpt - 40的准确率提高无统计学意义（p = 0.12），但其方差显著降低，反应一致性显著提高（97.3%比91.2-95.2%,p < 0.001）。gpt - 40 - rag在应用和临床营养类别中表现优于其他模型，但在数值问题上表现有限。术语频率-反向文档频率分析表明，错误答案更多地与数字术语相关。gpt - 40 - rag改善了反应一致性和领域特异性表现，表明在临床营养方面的实用性。然而，数值推理和个性化指导的局限性需要进一步发展和验证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance of GPT-4o combined with retrieval-augmented generation on nutritionist licensing exam questions.

GPT-4o, a general-purpose large language model, has a Retrieval-Augmented Variant (GPT-4o-RAG) that can assist in dietary counseling. However, research on its application in this field remains lacking. To bridge this gap, we used the Japanese National Examination for Registered Dietitians as a standardized benchmark for evaluation. Three language models-GPT-4o, GPT-4o-mini, and GPT-4o-RAG-were assessed using 599 publicly available multiple-choice questions from the 2022-2024 national examinations. For each model, we generated answers to each question five times and based our evaluation on these multiple outputs to assess response variability and robustness. A custom pipeline was implemented for GPT-4o-RAG to retrieve guideline-based documents for integration with GPT-generated responses. Accuracy rates, variance, and response consistency were evaluated. Term Frequency-Inverse Document Frequency analysis was conducted to compare word characteristics in correctly and incorrectly answered questions. All three models achieved accuracy rates >60%, the passing threshold. GPT-4o-RAG demonstrated the highest accuracy (83.5% ± 0.3%), followed by GPT-4o (82.1% ± 1.0%), and GPT-4o-mini (70.0% ± 1.4%). While the accuracy improvement of GPT-4o-RAG over GPT-4o was not statistically significant (p = 0.12), it exhibited significantly lower variance and higher response consistency (97.3% vs. 91.2-95.2%, p < 0.001). GPT-4o-RAG outperformed other models in applied and clinical nutrition categories but showed limited performance on numerical questions. Term Frequency-Inverse Document Frequency analysis suggested that incorrect answers were more frequently associated with numerical terms. GPT-4o-RAG improved response consistency and domain-specific performance, suggesting utility in clinical nutrition. However, limitations in numerical reasoning and individualized guidance warrant further development and validation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Endocrine journal 医学-内分泌学与代谢

CiteScore

4.30

自引率

5.00%

发文量

224

审稿时长

1.5 months

期刊介绍： Endocrine Journal is an open access, peer-reviewed online journal with a long history. This journal publishes peer-reviewed research articles in multifaceted fields of basic, translational and clinical endocrinology. Endocrine Journal provides a chance to exchange your ideas, concepts and scientific observations in any area of recent endocrinology. Manuscripts may be submitted as Original Articles, Notes, Rapid Communications or Review Articles. We have a rapid reviewing and editorial decision system and pay a special attention to our quick, truly scientific and frequently-citable publication. Please go through the link for author guideline.