Prut Saowaprut, Romen Samuel Wabina, Junwei Yang, Lertboon Siriwat
{"title":"大型语言模型在泰国国家医疗执照考试中的表现:一项横断面研究。","authors":"Prut Saowaprut, Romen Samuel Wabina, Junwei Yang, Lertboon Siriwat","doi":"10.3352/jeehp.2025.22.16","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study aimed to evaluate the feasibility of general-purpose large language models (LLMs) in addressing inequities in medical licensure exam preparation for Thailand's National Medical Licensing Examination (ThaiNLE), which currently lacks standardized public study materials.</p><p><strong>Methods: </strong>We assessed 4 multi-modal LLMs (GPT-4, Claude 3 Opus, Gemini 1.0/1.5 Pro) using a 304-question ThaiNLE Step 1 mock examination (10.2% image-based), applying deterministic API configurations and 5 inference repetitions per model. Performance was measured via micro- and macro-accuracy metrics compared against historical passing thresholds.</p><p><strong>Results: </strong>All models exceeded passing scores, with GPT-4 achieving the highest accuracy (88.9%; 95% confidence interval, 88.7-89.1), surpassing Thailand's national average by more than 2 standard deviations. Claude 3.5 Sonnet (80.1%) and Gemini 1.5 Pro (72.8%) followed hierarchically. Models demonstrated robustness across 17 of 20 medical domains, but variability was noted in genetics (74.0%) and cardiovascular topics (58.3%). While models demonstrated proficiency with images (Gemini 1.0 Pro: +9.9% vs. text), text-only accuracy remained superior (GPT-4o: 90.0% vs. 82.6%).</p><p><strong>Conclusion: </strong>General-purpose LLMs show promise as equitable preparatory tools for ThaiNLE Step 1. However, domain-specific knowledge gaps and inconsistent multi-modal integration warrant refinement before clinical deployment.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"16"},"PeriodicalIF":9.3000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance of large language models on Thailand's national medical licensing examination: a cross-sectional study.\",\"authors\":\"Prut Saowaprut, Romen Samuel Wabina, Junwei Yang, Lertboon Siriwat\",\"doi\":\"10.3352/jeehp.2025.22.16\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>This study aimed to evaluate the feasibility of general-purpose large language models (LLMs) in addressing inequities in medical licensure exam preparation for Thailand's National Medical Licensing Examination (ThaiNLE), which currently lacks standardized public study materials.</p><p><strong>Methods: </strong>We assessed 4 multi-modal LLMs (GPT-4, Claude 3 Opus, Gemini 1.0/1.5 Pro) using a 304-question ThaiNLE Step 1 mock examination (10.2% image-based), applying deterministic API configurations and 5 inference repetitions per model. Performance was measured via micro- and macro-accuracy metrics compared against historical passing thresholds.</p><p><strong>Results: </strong>All models exceeded passing scores, with GPT-4 achieving the highest accuracy (88.9%; 95% confidence interval, 88.7-89.1), surpassing Thailand's national average by more than 2 standard deviations. Claude 3.5 Sonnet (80.1%) and Gemini 1.5 Pro (72.8%) followed hierarchically. Models demonstrated robustness across 17 of 20 medical domains, but variability was noted in genetics (74.0%) and cardiovascular topics (58.3%). While models demonstrated proficiency with images (Gemini 1.0 Pro: +9.9% vs. text), text-only accuracy remained superior (GPT-4o: 90.0% vs. 82.6%).</p><p><strong>Conclusion: </strong>General-purpose LLMs show promise as equitable preparatory tools for ThaiNLE Step 1. However, domain-specific knowledge gaps and inconsistent multi-modal integration warrant refinement before clinical deployment.</p>\",\"PeriodicalId\":46098,\"journal\":{\"name\":\"Journal of Educational Evaluation for Health Professions\",\"volume\":\"22 \",\"pages\":\"16\"},\"PeriodicalIF\":9.3000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Educational Evaluation for Health Professions\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3352/jeehp.2025.22.16\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/5/12 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION, SCIENTIFIC DISCIPLINES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Educational Evaluation for Health Professions","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3352/jeehp.2025.22.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/12 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0
摘要
目的:本研究旨在评估通用大型语言模型(llm)在解决泰国国家医疗执照考试(ThaiNLE)的医疗执照考试准备不公平问题方面的可行性,该考试目前缺乏标准化的公共学习材料。方法:我们使用304个问题的ThaiNLE Step 1模拟考试(10.2%基于图像)评估了4个多模态llm (GPT-4, Claude 3 Opus, Gemini 1.0/1.5 Pro),应用确定性API配置和每个模型5次推理重复。性能通过与历史通过阈值进行比较的微观和宏观精度度量来衡量。结果:所有模型均超过及格分数,其中GPT-4准确率最高(88.9%;95%置信区间为88.7-89.1),超过泰国全国平均水平2个标准差以上。克劳德3.5十四行诗(80.1%)和双子座1.5 Pro(72.8%)紧随其后。模型在20个医学领域中的17个领域表现出稳健性,但在遗传学(74.0%)和心血管主题(58.3%)方面存在差异。虽然模型显示了对图像的熟练程度(Gemini 1.0 Pro: +9.9% vs.文本),但纯文本的准确率仍然更高(gpt - 40: 90.0% vs. 82.6%)。结论:通用法学硕士有望成为ThaiNLE第一步的公平准备工具。然而,领域特定的知识差距和不一致的多模式集成需要在临床部署之前进行改进。
Performance of large language models on Thailand's national medical licensing examination: a cross-sectional study.
Purpose: This study aimed to evaluate the feasibility of general-purpose large language models (LLMs) in addressing inequities in medical licensure exam preparation for Thailand's National Medical Licensing Examination (ThaiNLE), which currently lacks standardized public study materials.
Methods: We assessed 4 multi-modal LLMs (GPT-4, Claude 3 Opus, Gemini 1.0/1.5 Pro) using a 304-question ThaiNLE Step 1 mock examination (10.2% image-based), applying deterministic API configurations and 5 inference repetitions per model. Performance was measured via micro- and macro-accuracy metrics compared against historical passing thresholds.
Results: All models exceeded passing scores, with GPT-4 achieving the highest accuracy (88.9%; 95% confidence interval, 88.7-89.1), surpassing Thailand's national average by more than 2 standard deviations. Claude 3.5 Sonnet (80.1%) and Gemini 1.5 Pro (72.8%) followed hierarchically. Models demonstrated robustness across 17 of 20 medical domains, but variability was noted in genetics (74.0%) and cardiovascular topics (58.3%). While models demonstrated proficiency with images (Gemini 1.0 Pro: +9.9% vs. text), text-only accuracy remained superior (GPT-4o: 90.0% vs. 82.6%).
Conclusion: General-purpose LLMs show promise as equitable preparatory tools for ThaiNLE Step 1. However, domain-specific knowledge gaps and inconsistent multi-modal integration warrant refinement before clinical deployment.
期刊介绍:
Journal of Educational Evaluation for Health Professions aims to provide readers the state-of-the art practical information on the educational evaluation for health professions so that to increase the quality of undergraduate, graduate, and continuing education. It is specialized in educational evaluation including adoption of measurement theory to medical health education, promotion of high stakes examination such as national licensing examinations, improvement of nationwide or international programs of education, computer-based testing, computerized adaptive testing, and medical health regulatory bodies. Its field comprises a variety of professions that address public medical health as following but not limited to: Care workers Dental hygienists Dental technicians Dentists Dietitians Emergency medical technicians Health educators Medical record technicians Medical technologists Midwives Nurses Nursing aides Occupational therapists Opticians Oriental medical doctors Oriental medicine dispensers Oriental pharmacists Pharmacists Physical therapists Physicians Prosthetists and Orthotists Radiological technologists Rehabilitation counselor Sanitary technicians Speech-language therapists.