Jason Bitterman, Alexander D'Angelo, Alexandra Holachek, James E Eubanks
{"title":"Advancements in large language model accuracy for answering physical medicine and rehabilitation board review questions.","authors":"Jason Bitterman, Alexander D'Angelo, Alexandra Holachek, James E Eubanks","doi":"10.1002/pmrj.13386","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>There have been significant advances in machine learning and artificial intelligence technology over the past few years, leading to the release of large language models (LLMs) such as ChatGPT. There are many potential applications for LLMs in health care, but it is critical to first determine how accurate LLMs are before putting them into practice. No studies have evaluated the accuracy and precision of LLMs in responding to questions related to the field of physical medicine and rehabilitation (PM&R).</p><p><strong>Objective: </strong>To determine the accuracy and precision of two OpenAI LLMs (GPT-3.5, released in November 2022, and GPT-4o, released in May 2024) in answering questions related to PM&R knowledge.</p><p><strong>Design: </strong>Cross-sectional study. Both LLMs were tested on the same 744 PM&R knowledge questions that covered all aspects of the field (general rehabilitation, stroke, traumatic brain injury, spinal cord injury, musculoskeletal medicine, pain medicine, electrodiagnostic medicine, pediatric rehabilitation, prosthetics and orthotics, rheumatology, and pharmacology). Each LLM was tested three times on the same question set to assess for precision.</p><p><strong>Setting: </strong>N/A.</p><p><strong>Patients: </strong>N/A.</p><p><strong>Interventions: </strong>N/A.</p><p><strong>Main outcome measure: </strong>Percentage of correctly answered questions.</p><p><strong>Results: </strong>For three runs of the 744-question set, GPT-3.5 answered 56.3%, 56.5%, and 56.9% of the questions correctly. For three runs of the same question set, GPT-4o answered 83.6%, 84%, and 84.1% of the questions correctly. GPT-4o outperformed GPT-3.5 in all subcategories of PM&R questions.</p><p><strong>Conclusions: </strong>LLM technology is rapidly advancing, with the more recent GPT-4o model performing much better on PM&R knowledge questions compared to GPT-3.5. There is potential for LLMs in augmenting clinical practice, medical training, and patient education. However, the technology has limitations and physicians should remain cautious in using it in practice at this time.</p>","PeriodicalId":20354,"journal":{"name":"PM&R","volume":" ","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2025-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PM&R","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/pmrj.13386","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"REHABILITATION","Score":null,"Total":0}
引用次数: 0
Abstract
Background: There have been significant advances in machine learning and artificial intelligence technology over the past few years, leading to the release of large language models (LLMs) such as ChatGPT. There are many potential applications for LLMs in health care, but it is critical to first determine how accurate LLMs are before putting them into practice. No studies have evaluated the accuracy and precision of LLMs in responding to questions related to the field of physical medicine and rehabilitation (PM&R).
Objective: To determine the accuracy and precision of two OpenAI LLMs (GPT-3.5, released in November 2022, and GPT-4o, released in May 2024) in answering questions related to PM&R knowledge.
Design: Cross-sectional study. Both LLMs were tested on the same 744 PM&R knowledge questions that covered all aspects of the field (general rehabilitation, stroke, traumatic brain injury, spinal cord injury, musculoskeletal medicine, pain medicine, electrodiagnostic medicine, pediatric rehabilitation, prosthetics and orthotics, rheumatology, and pharmacology). Each LLM was tested three times on the same question set to assess for precision.
Setting: N/A.
Patients: N/A.
Interventions: N/A.
Main outcome measure: Percentage of correctly answered questions.
Results: For three runs of the 744-question set, GPT-3.5 answered 56.3%, 56.5%, and 56.9% of the questions correctly. For three runs of the same question set, GPT-4o answered 83.6%, 84%, and 84.1% of the questions correctly. GPT-4o outperformed GPT-3.5 in all subcategories of PM&R questions.
Conclusions: LLM technology is rapidly advancing, with the more recent GPT-4o model performing much better on PM&R knowledge questions compared to GPT-3.5. There is potential for LLMs in augmenting clinical practice, medical training, and patient education. However, the technology has limitations and physicians should remain cautious in using it in practice at this time.
期刊介绍:
Topics covered include acute and chronic musculoskeletal disorders and pain, neurologic conditions involving the central and peripheral nervous systems, rehabilitation of impairments associated with disabilities in adults and children, and neurophysiology and electrodiagnosis. PM&R emphasizes principles of injury, function, and rehabilitation, and is designed to be relevant to practitioners and researchers in a variety of medical and surgical specialties and rehabilitation disciplines including allied health.