评估ChatGPT-4在美国修复检查中的表现：微调和上下文提示与基础知识的影响，一项横断面研究。

IF 2.7 2区医学 Q1 EDUCATION & EDUCATIONAL RESEARCH

BMC Medical Education Pub Date : 2025-05-23 DOI:10.1186/s12909-025-07371-9

Mahmood Dashti, Farshad Khosraviani, Tara Azimi, Delband Hefzi, Shohreh Ghasemi, Amir Fahimipour, Niusha Zare, Zohaib Khurshid, Syed Rashid Habib

{"title":"评估ChatGPT-4在美国修复检查中的表现：微调和上下文提示与基础知识的影响，一项横断面研究。","authors":"Mahmood Dashti, Farshad Khosraviani, Tara Azimi, Delband Hefzi, Shohreh Ghasemi, Amir Fahimipour, Niusha Zare, Zohaib Khurshid, Syed Rashid Habib","doi":"10.1186/s12909-025-07371-9","DOIUrl":null,"url":null,"abstract":"Background: Artificial intelligence (AI), such as ChatGPT-4 from OpenAI, has the potential to transform medical education and assessment. However, its effectiveness in specialized fields like prosthodontics, especially when comparing base to fine-tuned models, remains underexplored. This study evaluates the performance of ChatGPT-4 on the US National Prosthodontic Resident Mock Exam in its base form and after fine-tuning. The aim is to determine whether fine-tuning improves the AI's accuracy in answering specialized questions.Methods: An official sample questions from the 2021 US National Prosthodontic Resident Mock Exam was used, obtained from the American College of Prosthodontists. A total of 150 questions were initially considered, and resources were available for 106 questions. Both the base and fine-tuned models of ChatGPT-4 were tested under simulated exam conditions. Performance was assessed by comparing correct and incorrect responses. The Chi-square test was used to analyze accuracy, with significance set at p < 0.05. The Kappa coefficient was calculated to measure agreement between the models' responses.Results: The base model of ChatGPT-4 correctly answered 62.7% of the 150 questions. For the 106 questions with resources, the fine-tuned model answered 73.6% correctly. The Chi-square test showed a significant improvement in performance after fine-tuning (p < 0.001). The Kappa coefficient was 0.39, indicating moderate agreement between the models (p < 0.001). Performance varied by topic, with lower accuracy in areas such as Implant Prosthodontics, Removable Prosthodontics, and Occlusion, though the fine-tuned model consistently outperformed the base model.Conclusions: Fine-tuning ChatGPT-4 with specific resources significantly enhances its accuracy in answering specialized prosthodontic exam questions. While the base model provides a solid baseline, fine-tuning is essential for improving AI performance in specialized fields. However, certain topics may require more targeted training to achieve higher accuracy.","PeriodicalId":51234,"journal":{"name":"BMC Medical Education","volume":"25 1","pages":"761"},"PeriodicalIF":2.7000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12102979/pdf/","citationCount":"0","resultStr":"{\"title\":\"Assessing ChatGPT-4's performance on the US prosthodontic exam: impact of fine-tuning and contextual prompting vs. base knowledge, a cross-sectional study.\",\"authors\":\"Mahmood Dashti, Farshad Khosraviani, Tara Azimi, Delband Hefzi, Shohreh Ghasemi, Amir Fahimipour, Niusha Zare, Zohaib Khurshid, Syed Rashid Habib\",\"doi\":\"10.1186/s12909-025-07371-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Artificial intelligence (AI), such as ChatGPT-4 from OpenAI, has the potential to transform medical education and assessment. However, its effectiveness in specialized fields like prosthodontics, especially when comparing base to fine-tuned models, remains underexplored. This study evaluates the performance of ChatGPT-4 on the US National Prosthodontic Resident Mock Exam in its base form and after fine-tuning. The aim is to determine whether fine-tuning improves the AI's accuracy in answering specialized questions.Methods: An official sample questions from the 2021 US National Prosthodontic Resident Mock Exam was used, obtained from the American College of Prosthodontists. A total of 150 questions were initially considered, and resources were available for 106 questions. Both the base and fine-tuned models of ChatGPT-4 were tested under simulated exam conditions. Performance was assessed by comparing correct and incorrect responses. The Chi-square test was used to analyze accuracy, with significance set at p < 0.05. The Kappa coefficient was calculated to measure agreement between the models' responses.Results: The base model of ChatGPT-4 correctly answered 62.7% of the 150 questions. For the 106 questions with resources, the fine-tuned model answered 73.6% correctly. The Chi-square test showed a significant improvement in performance after fine-tuning (p < 0.001). The Kappa coefficient was 0.39, indicating moderate agreement between the models (p < 0.001). Performance varied by topic, with lower accuracy in areas such as Implant Prosthodontics, Removable Prosthodontics, and Occlusion, though the fine-tuned model consistently outperformed the base model.Conclusions: Fine-tuning ChatGPT-4 with specific resources significantly enhances its accuracy in answering specialized prosthodontic exam questions. While the base model provides a solid baseline, fine-tuning is essential for improving AI performance in specialized fields. However, certain topics may require more targeted training to achieve higher accuracy.\",\"PeriodicalId\":51234,\"journal\":{\"name\":\"BMC Medical Education\",\"volume\":\"25 1\",\"pages\":\"761\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2025-05-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12102979/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Education\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12909-025-07371-9\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION & EDUCATIONAL RESEARCH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Education","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12909-025-07371-9","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

摘要

背景：人工智能（AI），如OpenAI的ChatGPT-4，有可能改变医学教育和评估。然而，它在修复学等专业领域的有效性，特别是在比较基础模型和微调模型时，仍然没有得到充分的探索。本研究评估ChatGPT-4在其基础形式和微调后在美国国家口腔修复住院医师模拟考试中的表现。目的是确定微调是否能提高人工智能在回答特定问题时的准确性。方法：使用从美国口腔修复医师学会获得的2021年美国国家口腔修复医师模拟考试的官方样题。最初总共考虑了150个问题，106个问题可获得资源。ChatGPT-4的基本模型和微调模型都在模拟考试条件下进行了测试。通过比较正确和错误的回答来评估表现。采用卡方检验分析准确率，显著性设置为p。结果：ChatGPT-4基本模型正确回答了150个问题中的62.7%。对于106个有资源的问题，微调模型的正确率为73.6%。卡方检验显示，经过微调后，ChatGPT-4的性能有了显著的提高(p)。结论：经过特定资源的微调，ChatGPT-4在回答专业修复考试问题时的准确性显著提高。虽然基本模型提供了坚实的基线，但微调对于提高专业领域的人工智能性能至关重要。然而，某些主题可能需要更有针对性的训练来达到更高的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Assessing ChatGPT-4's performance on the US prosthodontic exam: impact of fine-tuning and contextual prompting vs. base knowledge, a cross-sectional study.

Background: Artificial intelligence (AI), such as ChatGPT-4 from OpenAI, has the potential to transform medical education and assessment. However, its effectiveness in specialized fields like prosthodontics, especially when comparing base to fine-tuned models, remains underexplored. This study evaluates the performance of ChatGPT-4 on the US National Prosthodontic Resident Mock Exam in its base form and after fine-tuning. The aim is to determine whether fine-tuning improves the AI's accuracy in answering specialized questions.

Methods: An official sample questions from the 2021 US National Prosthodontic Resident Mock Exam was used, obtained from the American College of Prosthodontists. A total of 150 questions were initially considered, and resources were available for 106 questions. Both the base and fine-tuned models of ChatGPT-4 were tested under simulated exam conditions. Performance was assessed by comparing correct and incorrect responses. The Chi-square test was used to analyze accuracy, with significance set at p < 0.05. The Kappa coefficient was calculated to measure agreement between the models' responses.

Results: The base model of ChatGPT-4 correctly answered 62.7% of the 150 questions. For the 106 questions with resources, the fine-tuned model answered 73.6% correctly. The Chi-square test showed a significant improvement in performance after fine-tuning (p < 0.001). The Kappa coefficient was 0.39, indicating moderate agreement between the models (p < 0.001). Performance varied by topic, with lower accuracy in areas such as Implant Prosthodontics, Removable Prosthodontics, and Occlusion, though the fine-tuned model consistently outperformed the base model.

Conclusions: Fine-tuning ChatGPT-4 with specific resources significantly enhances its accuracy in answering specialized prosthodontic exam questions. While the base model provides a solid baseline, fine-tuning is essential for improving AI performance in specialized fields. However, certain topics may require more targeted training to achieve higher accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Medical Education EDUCATION, SCIENTIFIC DISCIPLINES-

CiteScore

4.90

自引率

11.10%

发文量

795

审稿时长

6 months

期刊介绍： BMC Medical Education is an open access journal publishing original peer-reviewed research articles in relation to the training of healthcare professionals, including undergraduate, postgraduate, and continuing education. The journal has a special focus on curriculum development, evaluations of performance, assessment of training needs and evidence-based medicine.