{"title":"gpt - 40结合检索增强生成在营养师执照考试试题中的表现。","authors":"Yu Ishikawa, Akitaka Higashi, Nozomu Arai, Daisuke Ozo, Wataru Hasegawa, Tetsuya Imamura, Zenbei Matsumoto, Hidetaka Nambo, Shigehiro Karashima","doi":"10.1507/endocrj.EJ25-0201","DOIUrl":null,"url":null,"abstract":"<p><p>GPT-4o, a general-purpose large language model, has a Retrieval-Augmented Variant (GPT-4o-RAG) that can assist in dietary counseling. However, research on its application in this field remains lacking. To bridge this gap, we used the Japanese National Examination for Registered Dietitians as a standardized benchmark for evaluation. Three language models-GPT-4o, GPT-4o-mini, and GPT-4o-RAG-were assessed using 599 publicly available multiple-choice questions from the 2022-2024 national examinations. For each model, we generated answers to each question five times and based our evaluation on these multiple outputs to assess response variability and robustness. A custom pipeline was implemented for GPT-4o-RAG to retrieve guideline-based documents for integration with GPT-generated responses. Accuracy rates, variance, and response consistency were evaluated. Term Frequency-Inverse Document Frequency analysis was conducted to compare word characteristics in correctly and incorrectly answered questions. All three models achieved accuracy rates >60%, the passing threshold. GPT-4o-RAG demonstrated the highest accuracy (83.5% ± 0.3%), followed by GPT-4o (82.1% ± 1.0%), and GPT-4o-mini (70.0% ± 1.4%). While the accuracy improvement of GPT-4o-RAG over GPT-4o was not statistically significant (p = 0.12), it exhibited significantly lower variance and higher response consistency (97.3% vs. 91.2-95.2%, p < 0.001). GPT-4o-RAG outperformed other models in applied and clinical nutrition categories but showed limited performance on numerical questions. Term Frequency-Inverse Document Frequency analysis suggested that incorrect answers were more frequently associated with numerical terms. GPT-4o-RAG improved response consistency and domain-specific performance, suggesting utility in clinical nutrition. However, limitations in numerical reasoning and individualized guidance warrant further development and validation.</p>","PeriodicalId":11631,"journal":{"name":"Endocrine journal","volume":" ","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance of GPT-4o combined with retrieval-augmented generation on nutritionist licensing exam questions.\",\"authors\":\"Yu Ishikawa, Akitaka Higashi, Nozomu Arai, Daisuke Ozo, Wataru Hasegawa, Tetsuya Imamura, Zenbei Matsumoto, Hidetaka Nambo, Shigehiro Karashima\",\"doi\":\"10.1507/endocrj.EJ25-0201\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>GPT-4o, a general-purpose large language model, has a Retrieval-Augmented Variant (GPT-4o-RAG) that can assist in dietary counseling. However, research on its application in this field remains lacking. To bridge this gap, we used the Japanese National Examination for Registered Dietitians as a standardized benchmark for evaluation. Three language models-GPT-4o, GPT-4o-mini, and GPT-4o-RAG-were assessed using 599 publicly available multiple-choice questions from the 2022-2024 national examinations. For each model, we generated answers to each question five times and based our evaluation on these multiple outputs to assess response variability and robustness. A custom pipeline was implemented for GPT-4o-RAG to retrieve guideline-based documents for integration with GPT-generated responses. Accuracy rates, variance, and response consistency were evaluated. Term Frequency-Inverse Document Frequency analysis was conducted to compare word characteristics in correctly and incorrectly answered questions. All three models achieved accuracy rates >60%, the passing threshold. GPT-4o-RAG demonstrated the highest accuracy (83.5% ± 0.3%), followed by GPT-4o (82.1% ± 1.0%), and GPT-4o-mini (70.0% ± 1.4%). While the accuracy improvement of GPT-4o-RAG over GPT-4o was not statistically significant (p = 0.12), it exhibited significantly lower variance and higher response consistency (97.3% vs. 91.2-95.2%, p < 0.001). GPT-4o-RAG outperformed other models in applied and clinical nutrition categories but showed limited performance on numerical questions. Term Frequency-Inverse Document Frequency analysis suggested that incorrect answers were more frequently associated with numerical terms. GPT-4o-RAG improved response consistency and domain-specific performance, suggesting utility in clinical nutrition. However, limitations in numerical reasoning and individualized guidance warrant further development and validation.</p>\",\"PeriodicalId\":11631,\"journal\":{\"name\":\"Endocrine journal\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2025-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Endocrine journal\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1507/endocrj.EJ25-0201\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ENDOCRINOLOGY & METABOLISM\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Endocrine journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1507/endocrj.EJ25-0201","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}
Performance of GPT-4o combined with retrieval-augmented generation on nutritionist licensing exam questions.
GPT-4o, a general-purpose large language model, has a Retrieval-Augmented Variant (GPT-4o-RAG) that can assist in dietary counseling. However, research on its application in this field remains lacking. To bridge this gap, we used the Japanese National Examination for Registered Dietitians as a standardized benchmark for evaluation. Three language models-GPT-4o, GPT-4o-mini, and GPT-4o-RAG-were assessed using 599 publicly available multiple-choice questions from the 2022-2024 national examinations. For each model, we generated answers to each question five times and based our evaluation on these multiple outputs to assess response variability and robustness. A custom pipeline was implemented for GPT-4o-RAG to retrieve guideline-based documents for integration with GPT-generated responses. Accuracy rates, variance, and response consistency were evaluated. Term Frequency-Inverse Document Frequency analysis was conducted to compare word characteristics in correctly and incorrectly answered questions. All three models achieved accuracy rates >60%, the passing threshold. GPT-4o-RAG demonstrated the highest accuracy (83.5% ± 0.3%), followed by GPT-4o (82.1% ± 1.0%), and GPT-4o-mini (70.0% ± 1.4%). While the accuracy improvement of GPT-4o-RAG over GPT-4o was not statistically significant (p = 0.12), it exhibited significantly lower variance and higher response consistency (97.3% vs. 91.2-95.2%, p < 0.001). GPT-4o-RAG outperformed other models in applied and clinical nutrition categories but showed limited performance on numerical questions. Term Frequency-Inverse Document Frequency analysis suggested that incorrect answers were more frequently associated with numerical terms. GPT-4o-RAG improved response consistency and domain-specific performance, suggesting utility in clinical nutrition. However, limitations in numerical reasoning and individualized guidance warrant further development and validation.
期刊介绍:
Endocrine Journal is an open access, peer-reviewed online journal with a long history. This journal publishes peer-reviewed research articles in multifaceted fields of basic, translational and clinical endocrinology. Endocrine Journal provides a chance to exchange your ideas, concepts and scientific observations in any area of recent endocrinology. Manuscripts may be submitted as Original Articles, Notes, Rapid Communications or Review Articles. We have a rapid reviewing and editorial decision system and pay a special attention to our quick, truly scientific and frequently-citable publication. Please go through the link for author guideline.