{"title":"Retrieval-augmented generation enhances large language model performance on the Japanese orthopedic board examination.","authors":"Juntaro Maruyama, Satoshi Maki, Takeo Furuya, Yuki Nagashima, Kyota Kitagawa, Yasunori Toki, Shuhei Iwata, Megumi Yazaki, Takaki Kitamura, Sho Gushiken, Yuji Noguchi, Masataka Miura, Masahiro Inoue, Yasuhiro Shiga, Kazuhide Inage, Sumihisa Orita, Seiji Ohtori","doi":"10.1016/j.jos.2025.03.003","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Large language models (LLMs) have shown potential in medical applications. However, their effectiveness in specialized medical domains remains underexplored. The integration of Retrieval-Augmented Generation (RAG) has been proposed to improve these models by reducing hallucinations and enhancing domain-specific information access. Through this evaluation, we aim to assess whether RAG can effectively bridge the gap between LLMs' current capabilities and the accuracy needed for medical use by examining GPT-3.5 Turbo, GPT-4o, and o1-preview on the 2024 Japanese Orthopedic Specialist Examination.</p><p><strong>Methods: </strong>A specialized database was created using the \"Standard Textbook of Orthopedics\", and GPT-3.5 Turbo, GPT-4o, and o1-preview were evaluated with and without RAG. Models were tested on text-based and image-based questions exactly as presented in Japanese. An error analysis was conducted to identify key performance factors.</p><p><strong>Results: </strong>GPT-3.5 Turbo showed no substantial improvement with RAG, with its overall accuracy remaining at 28 %, compared to its baseline of 29 % without RAG. GPT-4o rose from 62 % to 72 %, while o1-preview increased from 67 % to 84 %. Error analysis indicated that GPT-3.5 Turbo primarily failed to apply retrieved data, whereas GPT-4o and o1-preview made errors when the database lacked relevant information or when dealing with image-based questions.</p><p><strong>Conclusions: </strong>The integration of RAG significantly boosted performance for GPT-4o and especially o1-preview. While both models surpassed the passing threshold, o1-preview demonstrated a level of proficiency relevant to clinical practice. However, RAG did not improve performance on GPT-3.5 Turbo because it lacks effective reasoning abilities.</p>","PeriodicalId":16939,"journal":{"name":"Journal of Orthopaedic Science","volume":" ","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Orthopaedic Science","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jos.2025.03.003","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Large language models (LLMs) have shown potential in medical applications. However, their effectiveness in specialized medical domains remains underexplored. The integration of Retrieval-Augmented Generation (RAG) has been proposed to improve these models by reducing hallucinations and enhancing domain-specific information access. Through this evaluation, we aim to assess whether RAG can effectively bridge the gap between LLMs' current capabilities and the accuracy needed for medical use by examining GPT-3.5 Turbo, GPT-4o, and o1-preview on the 2024 Japanese Orthopedic Specialist Examination.
Methods: A specialized database was created using the "Standard Textbook of Orthopedics", and GPT-3.5 Turbo, GPT-4o, and o1-preview were evaluated with and without RAG. Models were tested on text-based and image-based questions exactly as presented in Japanese. An error analysis was conducted to identify key performance factors.
Results: GPT-3.5 Turbo showed no substantial improvement with RAG, with its overall accuracy remaining at 28 %, compared to its baseline of 29 % without RAG. GPT-4o rose from 62 % to 72 %, while o1-preview increased from 67 % to 84 %. Error analysis indicated that GPT-3.5 Turbo primarily failed to apply retrieved data, whereas GPT-4o and o1-preview made errors when the database lacked relevant information or when dealing with image-based questions.
Conclusions: The integration of RAG significantly boosted performance for GPT-4o and especially o1-preview. While both models surpassed the passing threshold, o1-preview demonstrated a level of proficiency relevant to clinical practice. However, RAG did not improve performance on GPT-3.5 Turbo because it lacks effective reasoning abilities.
期刊介绍:
The Journal of Orthopaedic Science is the official peer-reviewed journal of the Japanese Orthopaedic Association. The journal publishes the latest researches and topical debates in all fields of clinical and experimental orthopaedics, including musculoskeletal medicine, sports medicine, locomotive syndrome, trauma, paediatrics, oncology and biomaterials, as well as basic researches.