{"title":"Comparative Analysis of ChatGPT-3.5 and GPT-4 in Open-Ended Clinical Reasoning Across Dental Specialties.","authors":"Yasamin Babaee Hemmati, Morteza Rasouli, Mehran Falahchai","doi":"10.1111/eje.13144","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>The integration of large language models (LLMs) such as ChatGPT into health care has garnered increasing interest. While previous studies have assessed these models using structured multiple-choice questions, limited research has evaluated their performance on open-ended, scenario-based clinical tasks, particularly in dentistry. This study aimed to evaluate and compare the clinical reasoning capabilities of ChatGPT-3.5 and GPT-4 in formulating treatment plans across seven dental specialties using realistic, open-ended clinical scenarios.</p><p><strong>Methods: </strong>A cross-sectional analytical study, reported in accordance with the STROBE guidelines, was conducted using 70 dental cases spanning endodontics, oral and maxillofacial surgery, oral medicine, orthodontics, paediatric dentistry, periodontology, and radiology. Each case was submitted to both ChatGPT-3.5 and GPT-4 (paid version, November 2024). Responses were evaluated by specialty-specific expert panels using a three-level rubric (poor, average, good). Statistical analyses included chi-square tests and Fisher-Freeman-Halton exact tests (α = 0.05).</p><p><strong>Results: </strong>GPT-4 significantly outperformed GPT-3.5 in overall response quality (67.1% vs. 44.3% rated as 'good'; p = 0.016). Although no significant differences were observed across most specialties, GPT-4 showed a statistically superior performance in oral and maxillofacial surgery. Its advantage was more pronounced in complex cases, aligning with the model's enhanced contextual reasoning.</p><p><strong>Conclusion: </strong>GPT-4 demonstrated superior accuracy and consistency compared to GPT-3.5, particularly in clinically complex and integrative tasks. These findings support the potential of advanced LLMs as adjunct tools in dental education and decision-making, though specialty-specific applications and expert oversight remain essential.</p>","PeriodicalId":50488,"journal":{"name":"European Journal of Dental Education","volume":" ","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Dental Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1111/eje.13144","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: The integration of large language models (LLMs) such as ChatGPT into health care has garnered increasing interest. While previous studies have assessed these models using structured multiple-choice questions, limited research has evaluated their performance on open-ended, scenario-based clinical tasks, particularly in dentistry. This study aimed to evaluate and compare the clinical reasoning capabilities of ChatGPT-3.5 and GPT-4 in formulating treatment plans across seven dental specialties using realistic, open-ended clinical scenarios.
Methods: A cross-sectional analytical study, reported in accordance with the STROBE guidelines, was conducted using 70 dental cases spanning endodontics, oral and maxillofacial surgery, oral medicine, orthodontics, paediatric dentistry, periodontology, and radiology. Each case was submitted to both ChatGPT-3.5 and GPT-4 (paid version, November 2024). Responses were evaluated by specialty-specific expert panels using a three-level rubric (poor, average, good). Statistical analyses included chi-square tests and Fisher-Freeman-Halton exact tests (α = 0.05).
Results: GPT-4 significantly outperformed GPT-3.5 in overall response quality (67.1% vs. 44.3% rated as 'good'; p = 0.016). Although no significant differences were observed across most specialties, GPT-4 showed a statistically superior performance in oral and maxillofacial surgery. Its advantage was more pronounced in complex cases, aligning with the model's enhanced contextual reasoning.
Conclusion: GPT-4 demonstrated superior accuracy and consistency compared to GPT-3.5, particularly in clinically complex and integrative tasks. These findings support the potential of advanced LLMs as adjunct tools in dental education and decision-making, though specialty-specific applications and expert oversight remain essential.
期刊介绍:
The aim of the European Journal of Dental Education is to publish original topical and review articles of the highest quality in the field of Dental Education. The Journal seeks to disseminate widely the latest information on curriculum development teaching methodologies assessment techniques and quality assurance in the fields of dental undergraduate and postgraduate education and dental auxiliary personnel training. The scope includes the dental educational aspects of the basic medical sciences the behavioural sciences the interface with medical education information technology and distance learning and educational audit. Papers embodying the results of high-quality educational research of relevance to dentistry are particularly encouraged as are evidence-based reports of novel and established educational programmes and their outcomes.