Retrieval-augmented generation enhances large language model performance on the Japanese orthopedic board examination.

IF 1.5 4区 医学 Q3 ORTHOPEDICS
Juntaro Maruyama, Satoshi Maki, Takeo Furuya, Yuki Nagashima, Kyota Kitagawa, Yasunori Toki, Shuhei Iwata, Megumi Yazaki, Takaki Kitamura, Sho Gushiken, Yuji Noguchi, Masataka Miura, Masahiro Inoue, Yasuhiro Shiga, Kazuhide Inage, Sumihisa Orita, Seiji Ohtori
{"title":"Retrieval-augmented generation enhances large language model performance on the Japanese orthopedic board examination.","authors":"Juntaro Maruyama, Satoshi Maki, Takeo Furuya, Yuki Nagashima, Kyota Kitagawa, Yasunori Toki, Shuhei Iwata, Megumi Yazaki, Takaki Kitamura, Sho Gushiken, Yuji Noguchi, Masataka Miura, Masahiro Inoue, Yasuhiro Shiga, Kazuhide Inage, Sumihisa Orita, Seiji Ohtori","doi":"10.1016/j.jos.2025.03.003","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Large language models (LLMs) have shown potential in medical applications. However, their effectiveness in specialized medical domains remains underexplored. The integration of Retrieval-Augmented Generation (RAG) has been proposed to improve these models by reducing hallucinations and enhancing domain-specific information access. Through this evaluation, we aim to assess whether RAG can effectively bridge the gap between LLMs' current capabilities and the accuracy needed for medical use by examining GPT-3.5 Turbo, GPT-4o, and o1-preview on the 2024 Japanese Orthopedic Specialist Examination.</p><p><strong>Methods: </strong>A specialized database was created using the \"Standard Textbook of Orthopedics\", and GPT-3.5 Turbo, GPT-4o, and o1-preview were evaluated with and without RAG. Models were tested on text-based and image-based questions exactly as presented in Japanese. An error analysis was conducted to identify key performance factors.</p><p><strong>Results: </strong>GPT-3.5 Turbo showed no substantial improvement with RAG, with its overall accuracy remaining at 28 %, compared to its baseline of 29 % without RAG. GPT-4o rose from 62 % to 72 %, while o1-preview increased from 67 % to 84 %. Error analysis indicated that GPT-3.5 Turbo primarily failed to apply retrieved data, whereas GPT-4o and o1-preview made errors when the database lacked relevant information or when dealing with image-based questions.</p><p><strong>Conclusions: </strong>The integration of RAG significantly boosted performance for GPT-4o and especially o1-preview. While both models surpassed the passing threshold, o1-preview demonstrated a level of proficiency relevant to clinical practice. However, RAG did not improve performance on GPT-3.5 Turbo because it lacks effective reasoning abilities.</p>","PeriodicalId":16939,"journal":{"name":"Journal of Orthopaedic Science","volume":" ","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Orthopaedic Science","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jos.2025.03.003","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: Large language models (LLMs) have shown potential in medical applications. However, their effectiveness in specialized medical domains remains underexplored. The integration of Retrieval-Augmented Generation (RAG) has been proposed to improve these models by reducing hallucinations and enhancing domain-specific information access. Through this evaluation, we aim to assess whether RAG can effectively bridge the gap between LLMs' current capabilities and the accuracy needed for medical use by examining GPT-3.5 Turbo, GPT-4o, and o1-preview on the 2024 Japanese Orthopedic Specialist Examination.

Methods: A specialized database was created using the "Standard Textbook of Orthopedics", and GPT-3.5 Turbo, GPT-4o, and o1-preview were evaluated with and without RAG. Models were tested on text-based and image-based questions exactly as presented in Japanese. An error analysis was conducted to identify key performance factors.

Results: GPT-3.5 Turbo showed no substantial improvement with RAG, with its overall accuracy remaining at 28 %, compared to its baseline of 29 % without RAG. GPT-4o rose from 62 % to 72 %, while o1-preview increased from 67 % to 84 %. Error analysis indicated that GPT-3.5 Turbo primarily failed to apply retrieved data, whereas GPT-4o and o1-preview made errors when the database lacked relevant information or when dealing with image-based questions.

Conclusions: The integration of RAG significantly boosted performance for GPT-4o and especially o1-preview. While both models surpassed the passing threshold, o1-preview demonstrated a level of proficiency relevant to clinical practice. However, RAG did not improve performance on GPT-3.5 Turbo because it lacks effective reasoning abilities.

检索增强生成增强了日语骨科考试中大型语言模型的性能。
大型语言模型(llm)在医学应用中显示出潜力。然而,它们在专业医学领域的有效性仍未得到充分探索。提出了检索增强生成(RAG)的集成,通过减少幻觉和增强特定领域的信息访问来改进这些模型。通过本次评估,我们旨在通过对2024年日本骨科专科医师考试的GPT-3.5 Turbo、gpt - 40和01 -预览,评估RAG是否能够有效地弥合LLMs当前能力与医疗使用所需准确性之间的差距。方法:采用《骨科标准教材》建立专门的数据库,分别对GPT-3.5 Turbo、gpt - 40和o1-preview进行RAG和不RAG评价。模型在基于文本和基于图像的问题上进行了测试,与日语完全相同。进行了误差分析,以确定关键性能因素。结果:GPT-3.5 Turbo在没有RAG的情况下没有显着改善,其总体准确性保持在28%,而没有RAG的基线为29%。gpt - 40从62%上升到72%,而01 -preview从67%上升到84%。错误分析表明,GPT-3.5 Turbo主要无法应用检索到的数据,而gpt - 40和01 -preview在数据库缺乏相关信息或处理基于图像的问题时出现错误。结论:RAG的集成显著提高了gpt - 40的性能,尤其是o1-preview。虽然这两个模型都超过了通过门槛,但o1-preview显示出与临床实践相关的熟练程度。然而,由于缺乏有效的推理能力,RAG并没有提高GPT-3.5 Turbo的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Orthopaedic Science
Journal of Orthopaedic Science 医学-整形外科
CiteScore
3.00
自引率
0.00%
发文量
290
审稿时长
90 days
期刊介绍: The Journal of Orthopaedic Science is the official peer-reviewed journal of the Japanese Orthopaedic Association. The journal publishes the latest researches and topical debates in all fields of clinical and experimental orthopaedics, including musculoskeletal medicine, sports medicine, locomotive syndrome, trauma, paediatrics, oncology and biomaterials, as well as basic researches.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信