Mahmud Omar, Kareem Hijazi, Mohammad Omar, Girish N Nadkarni, Eyal Klang
{"title":"Performance of large language models on family medicine licensing exams.","authors":"Mahmud Omar, Kareem Hijazi, Mohammad Omar, Girish N Nadkarni, Eyal Klang","doi":"10.1093/fampra/cmaf035","DOIUrl":null,"url":null,"abstract":"<p><strong>Background and aim: </strong>Large language models (LLMs) have shown promise in specialized medical exams but remain less explored in family medicine and primary care. This study evaluated eight state-of-the-art LLMs on the official Israeli primary care licensing exam, focusing on prompt design and explanation quality.</p><p><strong>Methods: </strong>Two hundred multiple-choice questions were tested using simple and few-shot Chain-of-Thought prompts (prompts that include examples which illustrate reasoning). Performance differences were assessed with Cochran's Q and pairwise McNemar tests. A stress test of the top performer (openAI's o1-preview) examined 30 selected questions, with two physicians scoring explanations for accuracy, logic, and hallucinations (extra or fabricated information not supported by the question).</p><p><strong>Results: </strong>Five models exceeded the 65% passing threshold under simple prompts; seven did so with few-shot prompts. o1-preview reached 85.5%. In the stress test, explanations were generally coherent and accurate, with 5 of 120 flagged for hallucinations. Inter-rater agreement on explanation scoring was high (weighted kappa 0.773; Intraclass Correlation Coefficient (ICC) 0.776).</p><p><strong>Conclusions: </strong>Most tested models performed well on an official family medicine exam, especially with structured prompts. Nonetheless, multiple-choice formats cannot address broader clinical competencies such as physical exams and patient rapport. Future efforts should refine these models to eliminate hallucinations, test for socio-demographic biases, and ensure alignment with real-world demands.</p>","PeriodicalId":12209,"journal":{"name":"Family practice","volume":"42 4","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Family practice","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/fampra/cmaf035","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0
Abstract
Background and aim: Large language models (LLMs) have shown promise in specialized medical exams but remain less explored in family medicine and primary care. This study evaluated eight state-of-the-art LLMs on the official Israeli primary care licensing exam, focusing on prompt design and explanation quality.
Methods: Two hundred multiple-choice questions were tested using simple and few-shot Chain-of-Thought prompts (prompts that include examples which illustrate reasoning). Performance differences were assessed with Cochran's Q and pairwise McNemar tests. A stress test of the top performer (openAI's o1-preview) examined 30 selected questions, with two physicians scoring explanations for accuracy, logic, and hallucinations (extra or fabricated information not supported by the question).
Results: Five models exceeded the 65% passing threshold under simple prompts; seven did so with few-shot prompts. o1-preview reached 85.5%. In the stress test, explanations were generally coherent and accurate, with 5 of 120 flagged for hallucinations. Inter-rater agreement on explanation scoring was high (weighted kappa 0.773; Intraclass Correlation Coefficient (ICC) 0.776).
Conclusions: Most tested models performed well on an official family medicine exam, especially with structured prompts. Nonetheless, multiple-choice formats cannot address broader clinical competencies such as physical exams and patient rapport. Future efforts should refine these models to eliminate hallucinations, test for socio-demographic biases, and ensure alignment with real-world demands.
期刊介绍:
Family Practice is an international journal aimed at practitioners, teachers, and researchers in the fields of family medicine, general practice, and primary care in both developed and developing countries.
Family Practice offers its readership an international view of the problems and preoccupations in the field, while providing a medium of instruction and exploration.
The journal''s range and content covers such areas as health care delivery, epidemiology, public health, and clinical case studies. The journal aims to be interdisciplinary and contributions from other disciplines of medicine and social science are always welcomed.