{"title":"大型语言模型在护理考试中的表现:ChatGPT-3.5、ChatGPT-4和科大讯飞Spark在中国的对比分析","authors":"Peifang Li, Menglin Jiang, Jiali Chen, Ning Ning","doi":"10.1002/nop2.70317","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>While large language models (LLMs) have been widely utilised in nursing education, their performance in Chinese nursing examinations remains unexplored, particularly in the context of ChatGPT-3.5, ChatGPT-4 and iFLYTEK Spark.</p><p><strong>Purpose: </strong>This study assessed the performance of ChatGPT-3.5, ChatGPT-4 and iFLYTEK Spark on the 2022 China National Nursing Professional Qualification Exam (CNNPQE) at both the Junior and Intermediate levels. It also investigated whether the accuracy of these language models' responses correlated with the exam's difficulty or subject matter.</p><p><strong>Methods: </strong>We inputted 800 questions from the 2022 CNNPQE-Junior and CNNPQE-Intermediate exams into ChatGPT-3.5, ChatGPT-4 and iFLYTEK Spark to determine their accuracy rates in correctly answering the questions. We then analysed the correlation between these accuracy rates and the exams' difficulty levels or subjects.</p><p><strong>Results: </strong>The accuracy of ChatGPT-3.5, ChatGPT-4 and iFLYTEK Spark in the CNNPQE-Junior was 49.3% (197/400), 68.5% (274/400), and 61% (244/400), respectively, whereas it was 56.4% (225/399), 70.7% (282/399) and 57.6% (230/399) in the CNNPQE-Intermediate. When considering different grades, the differences in accuracy rates among the three models were statistically significant (M<sup>2</sup> = 95.531, degrees of freedom (df) = 4, p < 0.001). These accuracy rates of ChatGPT-4 in the elementary knowledge, relevant professional knowledge, professional knowledge, and professional practice ability were 74.5%, 63.5%, 79% and 62.3%, respectively, leading in accuracy in other subjects in the CNNPQE. The results of the Cochran-Mantel-Haenszel (CMH) test showed that when considering different subjects, there was a statistically significant difference in accuracy rates of three LLMs (M<sup>2</sup> = 97.435, df = 4, p < 0.001).</p><p><strong>Conclusions: </strong>ChatGPT-4 and iFLYTEK Spark performed well on Chinese nursing examinations and demonstrated potential as valuable tools in nursing education.</p>","PeriodicalId":48570,"journal":{"name":"Nursing Open","volume":"12 10","pages":"e70317"},"PeriodicalIF":2.3000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12491847/pdf/","citationCount":"0","resultStr":"{\"title\":\"Performance of Large Language Models in Nursing Examinations: Comparative Analysis of ChatGPT-3.5, ChatGPT-4 and iFLYTEK Spark in China.\",\"authors\":\"Peifang Li, Menglin Jiang, Jiali Chen, Ning Ning\",\"doi\":\"10.1002/nop2.70317\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>While large language models (LLMs) have been widely utilised in nursing education, their performance in Chinese nursing examinations remains unexplored, particularly in the context of ChatGPT-3.5, ChatGPT-4 and iFLYTEK Spark.</p><p><strong>Purpose: </strong>This study assessed the performance of ChatGPT-3.5, ChatGPT-4 and iFLYTEK Spark on the 2022 China National Nursing Professional Qualification Exam (CNNPQE) at both the Junior and Intermediate levels. It also investigated whether the accuracy of these language models' responses correlated with the exam's difficulty or subject matter.</p><p><strong>Methods: </strong>We inputted 800 questions from the 2022 CNNPQE-Junior and CNNPQE-Intermediate exams into ChatGPT-3.5, ChatGPT-4 and iFLYTEK Spark to determine their accuracy rates in correctly answering the questions. We then analysed the correlation between these accuracy rates and the exams' difficulty levels or subjects.</p><p><strong>Results: </strong>The accuracy of ChatGPT-3.5, ChatGPT-4 and iFLYTEK Spark in the CNNPQE-Junior was 49.3% (197/400), 68.5% (274/400), and 61% (244/400), respectively, whereas it was 56.4% (225/399), 70.7% (282/399) and 57.6% (230/399) in the CNNPQE-Intermediate. When considering different grades, the differences in accuracy rates among the three models were statistically significant (M<sup>2</sup> = 95.531, degrees of freedom (df) = 4, p < 0.001). These accuracy rates of ChatGPT-4 in the elementary knowledge, relevant professional knowledge, professional knowledge, and professional practice ability were 74.5%, 63.5%, 79% and 62.3%, respectively, leading in accuracy in other subjects in the CNNPQE. The results of the Cochran-Mantel-Haenszel (CMH) test showed that when considering different subjects, there was a statistically significant difference in accuracy rates of three LLMs (M<sup>2</sup> = 97.435, df = 4, p < 0.001).</p><p><strong>Conclusions: </strong>ChatGPT-4 and iFLYTEK Spark performed well on Chinese nursing examinations and demonstrated potential as valuable tools in nursing education.</p>\",\"PeriodicalId\":48570,\"journal\":{\"name\":\"Nursing Open\",\"volume\":\"12 10\",\"pages\":\"e70317\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12491847/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Nursing Open\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1002/nop2.70317\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"NURSING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nursing Open","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/nop2.70317","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"NURSING","Score":null,"Total":0}
Performance of Large Language Models in Nursing Examinations: Comparative Analysis of ChatGPT-3.5, ChatGPT-4 and iFLYTEK Spark in China.
Background: While large language models (LLMs) have been widely utilised in nursing education, their performance in Chinese nursing examinations remains unexplored, particularly in the context of ChatGPT-3.5, ChatGPT-4 and iFLYTEK Spark.
Purpose: This study assessed the performance of ChatGPT-3.5, ChatGPT-4 and iFLYTEK Spark on the 2022 China National Nursing Professional Qualification Exam (CNNPQE) at both the Junior and Intermediate levels. It also investigated whether the accuracy of these language models' responses correlated with the exam's difficulty or subject matter.
Methods: We inputted 800 questions from the 2022 CNNPQE-Junior and CNNPQE-Intermediate exams into ChatGPT-3.5, ChatGPT-4 and iFLYTEK Spark to determine their accuracy rates in correctly answering the questions. We then analysed the correlation between these accuracy rates and the exams' difficulty levels or subjects.
Results: The accuracy of ChatGPT-3.5, ChatGPT-4 and iFLYTEK Spark in the CNNPQE-Junior was 49.3% (197/400), 68.5% (274/400), and 61% (244/400), respectively, whereas it was 56.4% (225/399), 70.7% (282/399) and 57.6% (230/399) in the CNNPQE-Intermediate. When considering different grades, the differences in accuracy rates among the three models were statistically significant (M2 = 95.531, degrees of freedom (df) = 4, p < 0.001). These accuracy rates of ChatGPT-4 in the elementary knowledge, relevant professional knowledge, professional knowledge, and professional practice ability were 74.5%, 63.5%, 79% and 62.3%, respectively, leading in accuracy in other subjects in the CNNPQE. The results of the Cochran-Mantel-Haenszel (CMH) test showed that when considering different subjects, there was a statistically significant difference in accuracy rates of three LLMs (M2 = 97.435, df = 4, p < 0.001).
Conclusions: ChatGPT-4 and iFLYTEK Spark performed well on Chinese nursing examinations and demonstrated potential as valuable tools in nursing education.
期刊介绍:
Nursing Open is a peer reviewed open access journal that welcomes articles on all aspects of nursing and midwifery practice, research, education and policy. We aim to publish articles that contribute to the art and science of nursing and which have a positive impact on health either locally, nationally, regionally or globally