人工智能与人类认知：ChatGPT和参加欧洲眼科文凭考试的考生的比较分析。

IF 1.8 Q2 Medicine

Vision (Switzerland) Pub Date : 2025-04-09 DOI:10.3390/vision9020031

Anna P Maino, Jakub Klikowski, Brendan Strong, Wahid Ghaffari, Michał Woźniak, Tristan Bourcier, Andrzej Grzybowski

{"title":"人工智能与人类认知：ChatGPT和参加欧洲眼科文凭考试的考生的比较分析。","authors":"Anna P Maino, Jakub Klikowski, Brendan Strong, Wahid Ghaffari, Michał Woźniak, Tristan Bourcier, Andrzej Grzybowski","doi":"10.3390/vision9020031","DOIUrl":null,"url":null,"abstract":"Background/objectives: This paper aims to assess ChatGPT's performance in answering European Board of Ophthalmology Diploma (EBOD) examination papers and to compare these results to pass benchmarks and candidate results.Methods: This cross-sectional study used a sample of past exam papers from 2012, 2013, 2020-2023 EBOD examinations. This study analyzed ChatGPT's responses to 440 multiple choice questions (MCQs), each containing five true/false statements (2200 statements in total) and 48 single best answer (SBA) questions.Results: ChatGPT, for MCQs, scored on average 64.39%. ChatGPT's strongest metric performance for MCQs was precision (68.76%). ChatGPT performed best at answering pathology MCQs (Grubbs test p < 0.05). Optics and refraction had the lowest-scoring MCQ performance across all metrics. ChatGPT-3.5 Turbo performed worse than human candidates and ChatGPT-4o on easy questions (75% vs. 100% accuracy) but outperformed humans and ChatGPT-4o on challenging questions (50% vs. 28% accuracy). ChatGPT's SBA performance averaged 28.43%, with the highest score and strongest performance in precision (29.36%). Pathology SBA questions were consistently the lowest-scoring topic across most metrics. ChatGPT demonstrated a nonsignificant tendency to select option 1 more frequently (p = 0.19). When answering SBAs, human candidates scored higher than ChatGPT in all metric areas measured.Conclusions: ChatGPT performed stronger for true/false questions, scoring a pass mark in most instances. Performance was poorer for SBA questions, suggesting that ChatGPT's ability in information retrieval is better than that in knowledge integration. ChatGPT could become a valuable tool in ophthalmic education, allowing exam boards to test their exam papers to ensure they are pitched at the right level, marking open-ended questions and providing detailed feedback.","PeriodicalId":36586,"journal":{"name":"Vision (Switzerland)","volume":"9 2","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12015923/pdf/","citationCount":"0","resultStr":"{\"title\":\"Artificial Intelligence vs. Human Cognition: A Comparative Analysis of ChatGPT and Candidates Sitting the European Board of Ophthalmology Diploma Examination.\",\"authors\":\"Anna P Maino, Jakub Klikowski, Brendan Strong, Wahid Ghaffari, Michał Woźniak, Tristan Bourcier, Andrzej Grzybowski\",\"doi\":\"10.3390/vision9020031\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background/objectives: This paper aims to assess ChatGPT's performance in answering European Board of Ophthalmology Diploma (EBOD) examination papers and to compare these results to pass benchmarks and candidate results.Methods: This cross-sectional study used a sample of past exam papers from 2012, 2013, 2020-2023 EBOD examinations. This study analyzed ChatGPT's responses to 440 multiple choice questions (MCQs), each containing five true/false statements (2200 statements in total) and 48 single best answer (SBA) questions.Results: ChatGPT, for MCQs, scored on average 64.39%. ChatGPT's strongest metric performance for MCQs was precision (68.76%). ChatGPT performed best at answering pathology MCQs (Grubbs test p < 0.05). Optics and refraction had the lowest-scoring MCQ performance across all metrics. ChatGPT-3.5 Turbo performed worse than human candidates and ChatGPT-4o on easy questions (75% vs. 100% accuracy) but outperformed humans and ChatGPT-4o on challenging questions (50% vs. 28% accuracy). ChatGPT's SBA performance averaged 28.43%, with the highest score and strongest performance in precision (29.36%). Pathology SBA questions were consistently the lowest-scoring topic across most metrics. ChatGPT demonstrated a nonsignificant tendency to select option 1 more frequently (p = 0.19). When answering SBAs, human candidates scored higher than ChatGPT in all metric areas measured.Conclusions: ChatGPT performed stronger for true/false questions, scoring a pass mark in most instances. Performance was poorer for SBA questions, suggesting that ChatGPT's ability in information retrieval is better than that in knowledge integration. ChatGPT could become a valuable tool in ophthalmic education, allowing exam boards to test their exam papers to ensure they are pitched at the right level, marking open-ended questions and providing detailed feedback.\",\"PeriodicalId\":36586,\"journal\":{\"name\":\"Vision (Switzerland)\",\"volume\":\"9 2\",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2025-04-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12015923/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Vision (Switzerland)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/vision9020031\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Vision (Switzerland)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/vision9020031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

摘要

背景/目的：本文旨在评估ChatGPT在回答欧洲眼科文凭（EBOD）考试试卷中的表现，并将这些结果与通过基准和候选人结果进行比较。方法：本横断面研究使用2012年、2013年、2020-2023年EBOD考试的试卷样本。这项研究分析了ChatGPT对440个选择题（mcq）的回答，每个选择题包含5个真假陈述（总共2200个陈述）和48个单一最佳答案（SBA）问题。结果：mcq的ChatGPT平均得分为64.39%。ChatGPT在mcq中表现最好的指标是精度（68.76%）。ChatGPT对病理mcq的回答效果最好（Grubbs检验p < 0.05）。光学和折射在所有指标的MCQ表现中得分最低。ChatGPT-3.5 Turbo在简单问题上的表现不如人类候选人和chatgpt - 40（75%对100%的准确率），但在挑战性问题上的表现优于人类和chatgpt - 40（50%对28%的准确率）。ChatGPT的SBA性能平均为28.43%，在精度方面得分最高，表现最强（29.36%）。病理SBA问题始终是大多数指标中得分最低的主题。ChatGPT更频繁地选择选项1的趋势不显著（p = 0.19）。在回答SBAs时，人类候选人在所有度量领域的得分都高于ChatGPT。结论：ChatGPT在真假问题上表现更强，在大多数情况下得分及格。SBA问题的表现较差，说明ChatGPT在信息检索方面的能力强于知识整合方面的能力。ChatGPT可以成为眼科教育的一个有价值的工具，允许考试委员会测试他们的试卷，以确保他们的水平正确，标记开放式问题，并提供详细的反馈。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Artificial Intelligence vs. Human Cognition: A Comparative Analysis of ChatGPT and Candidates Sitting the European Board of Ophthalmology Diploma Examination.

查看原文本刊更多论文

Artificial Intelligence vs. Human Cognition: A Comparative Analysis of ChatGPT and Candidates Sitting the European Board of Ophthalmology Diploma Examination.

Background/objectives: This paper aims to assess ChatGPT's performance in answering European Board of Ophthalmology Diploma (EBOD) examination papers and to compare these results to pass benchmarks and candidate results.

Methods: This cross-sectional study used a sample of past exam papers from 2012, 2013, 2020-2023 EBOD examinations. This study analyzed ChatGPT's responses to 440 multiple choice questions (MCQs), each containing five true/false statements (2200 statements in total) and 48 single best answer (SBA) questions.

Results: ChatGPT, for MCQs, scored on average 64.39%. ChatGPT's strongest metric performance for MCQs was precision (68.76%). ChatGPT performed best at answering pathology MCQs (Grubbs test p < 0.05). Optics and refraction had the lowest-scoring MCQ performance across all metrics. ChatGPT-3.5 Turbo performed worse than human candidates and ChatGPT-4o on easy questions (75% vs. 100% accuracy) but outperformed humans and ChatGPT-4o on challenging questions (50% vs. 28% accuracy). ChatGPT's SBA performance averaged 28.43%, with the highest score and strongest performance in precision (29.36%). Pathology SBA questions were consistently the lowest-scoring topic across most metrics. ChatGPT demonstrated a nonsignificant tendency to select option 1 more frequently (p = 0.19). When answering SBAs, human candidates scored higher than ChatGPT in all metric areas measured.

Conclusions: ChatGPT performed stronger for true/false questions, scoring a pass mark in most instances. Performance was poorer for SBA questions, suggesting that ChatGPT's ability in information retrieval is better than that in knowledge integration. ChatGPT could become a valuable tool in ophthalmic education, allowing exam boards to test their exam papers to ensure they are pitched at the right level, marking open-ended questions and providing detailed feedback.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Vision (Switzerland) Health Professions-Optometry

CiteScore

2.30

自引率

0.00%

发文量

审稿时长

11 weeks