Evaluating the Performance of Large Language Models (LLMs) in Answering and Analysing the Chinese Dental Licensing Examination

IF 1.9 4区教育学 Q3 DENTISTRY, ORAL SURGERY & MEDICINE

European Journal of Dental Education Pub Date : 2025-01-31 DOI:10.1111/eje.13073

Yu-Tao Xiong, Zheng-Zhe Zhan, Cheng-Lan Zhong, Wei Zeng, Ji-Xiang Guo, Wei Tang, Chang Liu

{"title":"Evaluating the Performance of Large Language Models (LLMs) in Answering and Analysing the Chinese Dental Licensing Examination","authors":"Yu-Tao Xiong, Zheng-Zhe Zhan, Cheng-Lan Zhong, Wei Zeng, Ji-Xiang Guo, Wei Tang, Chang Liu","doi":"10.1111/eje.13073","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>This study aimed to simulate diverse scenarios of students employing LLMs for CDLE examination preparation, providing a detailed evaluation of their performance in medical education.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>A stratified random sampling strategy was implemented to select and subsequently revise 200 questions from the CDLE. Seven LLMs, recognised for their exceptional performance in the Chinese domain, were selected as test subjects. Three distinct testing scenarios were constructed: answering questions, explaining questions and adversarial testing. The evaluation metrics included accuracy, agreement rate and teaching effectiveness score. Wald <i>χ</i><sup>2</sup> tests and Kruskal–Wallis tests were employed to determine whether the differences among the LLMs across various scenarios and before and after adversarial testing were statistically significant.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>The majority of the tested LLMs met the passing threshold on the CDLE benchmark, with Doubao-pro 32k and Qwen2-72b (81%) achieving the highest accuracy rates. Doubao-pro 32k demonstrated the highest 98% agreement rate with the reference answers when providing explanations. Although statistically significant differences existed among various LLMs in their teaching effectiveness scores based on the Likert scale, all these models demonstrated a commendable ability to deliver comprehensible and effective instructional content. In adversarial testing, GPT-4 exhibited the smallest decline in accuracy (2%, <i>p</i> = 0.623), while ChatGLM-4 demonstrated the least reduction in agreement rate (14.6%, <i>p</i> = 0.001).</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>LLMs trained on Chinese corpora, such as Doubao-pro 32k, demonstrated superior performance compared to GPT-4 in answering and explaining questions, with no statistically significant difference. However, during adversarial testing, all models exhibited diminished performance, with GPT-4 displaying comparatively greater robustness. Future research should further investigate the interpretability of LLM outputs and develop strategies to mitigate hallucinations generated in medical education.</p>\n </section>\n </div>","PeriodicalId":50488,"journal":{"name":"European Journal of Dental Education","volume":"29 2","pages":"332-340"},"PeriodicalIF":1.9000,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Dental Education","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/eje.13073","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Background

This study aimed to simulate diverse scenarios of students employing LLMs for CDLE examination preparation, providing a detailed evaluation of their performance in medical education.

Methods

A stratified random sampling strategy was implemented to select and subsequently revise 200 questions from the CDLE. Seven LLMs, recognised for their exceptional performance in the Chinese domain, were selected as test subjects. Three distinct testing scenarios were constructed: answering questions, explaining questions and adversarial testing. The evaluation metrics included accuracy, agreement rate and teaching effectiveness score. Wald χ² tests and Kruskal–Wallis tests were employed to determine whether the differences among the LLMs across various scenarios and before and after adversarial testing were statistically significant.

Results

The majority of the tested LLMs met the passing threshold on the CDLE benchmark, with Doubao-pro 32k and Qwen2-72b (81%) achieving the highest accuracy rates. Doubao-pro 32k demonstrated the highest 98% agreement rate with the reference answers when providing explanations. Although statistically significant differences existed among various LLMs in their teaching effectiveness scores based on the Likert scale, all these models demonstrated a commendable ability to deliver comprehensible and effective instructional content. In adversarial testing, GPT-4 exhibited the smallest decline in accuracy (2%, p = 0.623), while ChatGLM-4 demonstrated the least reduction in agreement rate (14.6%, p = 0.001).

Conclusions

LLMs trained on Chinese corpora, such as Doubao-pro 32k, demonstrated superior performance compared to GPT-4 in answering and explaining questions, with no statistically significant difference. However, during adversarial testing, all models exhibited diminished performance, with GPT-4 displaying comparatively greater robustness. Future research should further investigate the interpretability of LLM outputs and develop strategies to mitigate hallucinations generated in medical education.

查看原文本刊更多论文

大型语言模型（LLMs）在中国牙科执照考试答题与分析中的表现评价。

背景：本研究旨在模拟llm学生在CDLE考试准备中的不同场景，对其在医学教育中的表现进行详细的评估。方法：采用分层随机抽样的方法，从CDLE中抽取200个问题进行修正。7位法学硕士被选为测试对象，他们在中文领域的杰出表现得到了认可。构建了三种不同的测试场景：回答问题、解释问题和对抗测试。评估指标包括准确率、符合率和教学效果评分。采用Wald χ2检验和Kruskal-Wallis检验确定不同情况下llm之间以及对抗性检验前后的差异是否具有统计学意义。结果：大多数被测llm在CDLE基准上达到了通过阈值，其中豆宝pro 32k和Qwen2-72b（81%）的准确率最高。豆瓣pro 32k在提供解释时与参考答案的符合率最高，达到98%。尽管不同法学硕士在基于Likert量表的教学有效性得分上存在统计学上的显著差异，但所有这些模型都显示出值得称道的提供可理解和有效教学内容的能力。在对抗性测试中，GPT-4表现出最小的准确性下降（2%,p = 0.623），而ChatGLM-4表现出最小的一致性下降（14.6%,p = 0.001）。结论：在中文语料库上训练的法学硕士，如Doubao-pro 32k，在回答和解释问题方面表现优于GPT-4，但差异无统计学意义。然而，在对抗性测试中，所有模型都表现出性能下降，GPT-4表现出相对更强的鲁棒性。未来的研究应该进一步调查法学硕士产出的可解释性，并制定策略来减轻医学教育中产生的幻觉。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

European Journal of Dental Education 医学-学科教育

CiteScore

4.10

自引率

16.70%

发文量

127

审稿时长

6-12 weeks

期刊介绍： The aim of the European Journal of Dental Education is to publish original topical and review articles of the highest quality in the field of Dental Education. The Journal seeks to disseminate widely the latest information on curriculum development teaching methodologies assessment techniques and quality assurance in the fields of dental undergraduate and postgraduate education and dental auxiliary personnel training. The scope includes the dental educational aspects of the basic medical sciences the behavioural sciences the interface with medical education information technology and distance learning and educational audit. Papers embodying the results of high-quality educational research of relevance to dentistry are particularly encouraged as are evidence-based reports of novel and established educational programmes and their outcomes.