Performance of large language models on family medicine licensing exams.

IF 2.2 4区医学 Q1 MEDICINE, GENERAL & INTERNAL

Family practice Pub Date : 2025-06-04 DOI:10.1093/fampra/cmaf035

Mahmud Omar, Kareem Hijazi, Mohammad Omar, Girish N Nadkarni, Eyal Klang

{"title":"Performance of large language models on family medicine licensing exams.","authors":"Mahmud Omar, Kareem Hijazi, Mohammad Omar, Girish N Nadkarni, Eyal Klang","doi":"10.1093/fampra/cmaf035","DOIUrl":null,"url":null,"abstract":"Background and aim: Large language models (LLMs) have shown promise in specialized medical exams but remain less explored in family medicine and primary care. This study evaluated eight state-of-the-art LLMs on the official Israeli primary care licensing exam, focusing on prompt design and explanation quality.Methods: Two hundred multiple-choice questions were tested using simple and few-shot Chain-of-Thought prompts (prompts that include examples which illustrate reasoning). Performance differences were assessed with Cochran's Q and pairwise McNemar tests. A stress test of the top performer (openAI's o1-preview) examined 30 selected questions, with two physicians scoring explanations for accuracy, logic, and hallucinations (extra or fabricated information not supported by the question).Results: Five models exceeded the 65% passing threshold under simple prompts; seven did so with few-shot prompts. o1-preview reached 85.5%. In the stress test, explanations were generally coherent and accurate, with 5 of 120 flagged for hallucinations. Inter-rater agreement on explanation scoring was high (weighted kappa 0.773; Intraclass Correlation Coefficient (ICC) 0.776).Conclusions: Most tested models performed well on an official family medicine exam, especially with structured prompts. Nonetheless, multiple-choice formats cannot address broader clinical competencies such as physical exams and patient rapport. Future efforts should refine these models to eliminate hallucinations, test for socio-demographic biases, and ensure alignment with real-world demands.","PeriodicalId":12209,"journal":{"name":"Family practice","volume":"42 4","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Family practice","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/fampra/cmaf035","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

Abstract

Background and aim: Large language models (LLMs) have shown promise in specialized medical exams but remain less explored in family medicine and primary care. This study evaluated eight state-of-the-art LLMs on the official Israeli primary care licensing exam, focusing on prompt design and explanation quality.

Methods: Two hundred multiple-choice questions were tested using simple and few-shot Chain-of-Thought prompts (prompts that include examples which illustrate reasoning). Performance differences were assessed with Cochran's Q and pairwise McNemar tests. A stress test of the top performer (openAI's o1-preview) examined 30 selected questions, with two physicians scoring explanations for accuracy, logic, and hallucinations (extra or fabricated information not supported by the question).

Results: Five models exceeded the 65% passing threshold under simple prompts; seven did so with few-shot prompts. o1-preview reached 85.5%. In the stress test, explanations were generally coherent and accurate, with 5 of 120 flagged for hallucinations. Inter-rater agreement on explanation scoring was high (weighted kappa 0.773; Intraclass Correlation Coefficient (ICC) 0.776).

Conclusions: Most tested models performed well on an official family medicine exam, especially with structured prompts. Nonetheless, multiple-choice formats cannot address broader clinical competencies such as physical exams and patient rapport. Future efforts should refine these models to eliminate hallucinations, test for socio-demographic biases, and ensure alignment with real-world demands.

查看原文本刊更多论文

大型语言模型在家庭医学执照考试中的表现。

背景和目的：大型语言模型（LLMs）在专业医学考试中显示出前景，但在家庭医学和初级保健中仍未得到充分探索。本研究评估了以色列官方初级保健许可考试中8位最先进的法学硕士，重点关注提示设计和解释质量。方法：200个选择题使用简单的和少数镜头的思维链提示（提示包括说明推理的例子）进行测试。使用Cochran's Q和成对McNemar测试评估性能差异。对表现最好的人（openAI的01 -预览版）进行了压力测试，检查了30个选定的问题，由两名医生对准确性、逻辑性和幻觉（问题不支持的额外或捏造的信息）的解释进行评分。结果：5个模型在简单提示下超过65%的合格率；其中有7家公司是在很少的提示下完成的。o1预览达到85.5%。在压力测试中，解释总体上是连贯和准确的，120人中有5人被标记为幻觉。评价者对解释评分的一致性较高(加权kappa为0.773；类内相关系数（ICC） 0.776。结论：大多数测试模型在官方家庭医学考试中表现良好，特别是结构化提示。尽管如此，多项选择的形式不能解决更广泛的临床能力，如身体检查和患者关系。未来的努力应该完善这些模型，以消除幻觉，测试社会人口偏见，并确保与现实世界的需求保持一致。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Family practice 医学-医学：内科

CiteScore

4.30

自引率

9.10%

发文量

144

审稿时长

4-8 weeks

期刊介绍： Family Practice is an international journal aimed at practitioners, teachers, and researchers in the fields of family medicine, general practice, and primary care in both developed and developing countries. Family Practice offers its readership an international view of the problems and preoccupations in the field, while providing a medium of instruction and exploration. The journal''s range and content covers such areas as health care delivery, epidemiology, public health, and clinical case studies. The journal aims to be interdisciplinary and contributions from other disciplines of medicine and social science are always welcomed.