Retrieval-augmented generation enhances large language model performance on the Japanese orthopedic board examination.

IF 1.5 4区 医学 Q3 ORTHOPEDICS
Juntaro Maruyama, Satoshi Maki, Takeo Furuya, Yuki Nagashima, Kyota Kitagawa, Yasunori Toki, Shuhei Iwata, Megumi Yazaki, Takaki Kitamura, Sho Gushiken, Yuji Noguchi, Masataka Miura, Masahiro Inoue, Yasuhiro Shiga, Kazuhide Inage, Sumihisa Orita, Seiji Ohtori
{"title":"Retrieval-augmented generation enhances large language model performance on the Japanese orthopedic board examination.","authors":"Juntaro Maruyama, Satoshi Maki, Takeo Furuya, Yuki Nagashima, Kyota Kitagawa, Yasunori Toki, Shuhei Iwata, Megumi Yazaki, Takaki Kitamura, Sho Gushiken, Yuji Noguchi, Masataka Miura, Masahiro Inoue, Yasuhiro Shiga, Kazuhide Inage, Sumihisa Orita, Seiji Ohtori","doi":"10.1016/j.jos.2025.03.003","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Large language models (LLMs) have shown potential in medical applications. However, their effectiveness in specialized medical domains remains underexplored. The integration of Retrieval-Augmented Generation (RAG) has been proposed to improve these models by reducing hallucinations and enhancing domain-specific information access. Through this evaluation, we aim to assess whether RAG can effectively bridge the gap between LLMs' current capabilities and the accuracy needed for medical use by examining GPT-3.5 Turbo, GPT-4o, and o1-preview on the 2024 Japanese Orthopedic Specialist Examination.</p><p><strong>Methods: </strong>A specialized database was created using the \"Standard Textbook of Orthopedics\", and GPT-3.5 Turbo, GPT-4o, and o1-preview were evaluated with and without RAG. Models were tested on text-based and image-based questions exactly as presented in Japanese. An error analysis was conducted to identify key performance factors.</p><p><strong>Results: </strong>GPT-3.5 Turbo showed no substantial improvement with RAG, with its overall accuracy remaining at 28 %, compared to its baseline of 29 % without RAG. GPT-4o rose from 62 % to 72 %, while o1-preview increased from 67 % to 84 %. Error analysis indicated that GPT-3.5 Turbo primarily failed to apply retrieved data, whereas GPT-4o and o1-preview made errors when the database lacked relevant information or when dealing with image-based questions.</p><p><strong>Conclusions: </strong>The integration of RAG significantly boosted performance for GPT-4o and especially o1-preview. While both models surpassed the passing threshold, o1-preview demonstrated a level of proficiency relevant to clinical practice. However, RAG did not improve performance on GPT-3.5 Turbo because it lacks effective reasoning abilities.</p>","PeriodicalId":16939,"journal":{"name":"Journal of Orthopaedic Science","volume":" ","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Orthopaedic Science","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jos.2025.03.003","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: Large language models (LLMs) have shown potential in medical applications. However, their effectiveness in specialized medical domains remains underexplored. The integration of Retrieval-Augmented Generation (RAG) has been proposed to improve these models by reducing hallucinations and enhancing domain-specific information access. Through this evaluation, we aim to assess whether RAG can effectively bridge the gap between LLMs' current capabilities and the accuracy needed for medical use by examining GPT-3.5 Turbo, GPT-4o, and o1-preview on the 2024 Japanese Orthopedic Specialist Examination.

Methods: A specialized database was created using the "Standard Textbook of Orthopedics", and GPT-3.5 Turbo, GPT-4o, and o1-preview were evaluated with and without RAG. Models were tested on text-based and image-based questions exactly as presented in Japanese. An error analysis was conducted to identify key performance factors.

Results: GPT-3.5 Turbo showed no substantial improvement with RAG, with its overall accuracy remaining at 28 %, compared to its baseline of 29 % without RAG. GPT-4o rose from 62 % to 72 %, while o1-preview increased from 67 % to 84 %. Error analysis indicated that GPT-3.5 Turbo primarily failed to apply retrieved data, whereas GPT-4o and o1-preview made errors when the database lacked relevant information or when dealing with image-based questions.

Conclusions: The integration of RAG significantly boosted performance for GPT-4o and especially o1-preview. While both models surpassed the passing threshold, o1-preview demonstrated a level of proficiency relevant to clinical practice. However, RAG did not improve performance on GPT-3.5 Turbo because it lacks effective reasoning abilities.

求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Orthopaedic Science
Journal of Orthopaedic Science 医学-整形外科
CiteScore
3.00
自引率
0.00%
发文量
290
审稿时长
90 days
期刊介绍: The Journal of Orthopaedic Science is the official peer-reviewed journal of the Japanese Orthopaedic Association. The journal publishes the latest researches and topical debates in all fields of clinical and experimental orthopaedics, including musculoskeletal medicine, sports medicine, locomotive syndrome, trauma, paediatrics, oncology and biomaterials, as well as basic researches.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信