Comparison of generative AI performance on undergraduate and postgraduate written assessments in the biomedical sciences

IF 16.7 1区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

International Journal of Educational Technology in Higher Education Pub Date : 2024-09-13 DOI:10.1186/s41239-024-00485-y

Andrew Williams

{"title":"Comparison of generative AI performance on undergraduate and postgraduate written assessments in the biomedical sciences","authors":"Andrew Williams","doi":"10.1186/s41239-024-00485-y","DOIUrl":null,"url":null,"abstract":"<p>The value of generative AI tools in higher education has received considerable attention. Although there are many proponents of its value as a learning tool, many are concerned with the issues regarding academic integrity and its use by students to compose written assessments. This study evaluates and compares the output of three commonly used generative AI tools, ChatGPT, Bing and Bard. Each AI tool was prompted with an essay question from undergraduate (UG) level 4 (year 1), level 5 (year 2), level 6 (year 3) and postgraduate (PG) level 7 biomedical sciences courses. Anonymised AI generated output was then evaluated by four independent markers, according to specified marking criteria and matched to the Frameworks for Higher Education Qualifications (FHEQ) of UK level descriptors. Percentage scores and ordinal grades were given for each marking criteria across AI generated papers, inter-rater reliability was calculated using Kendall’s coefficient of concordance and generative AI performance ranked. Across all UG and PG levels, ChatGPT performed better than Bing or Bard in areas of scientific accuracy, scientific detail and context. All AI tools performed consistently well at PG level compared to UG level, although only ChatGPT consistently met levels of high attainment at all UG levels. ChatGPT and Bing did not provide adequate references, while Bing falsified references. In conclusion, generative AI tools are useful for providing scientific information consistent with the academic standards required of students in written assignments. These findings have broad implications for the design, implementation and grading of written assessments in higher education.</p>","PeriodicalId":13871,"journal":{"name":"International Journal of Educational Technology in Higher Education","volume":"28 1","pages":""},"PeriodicalIF":16.7000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Educational Technology in Higher Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1186/s41239-024-00485-y","RegionNum":1,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

Abstract

The value of generative AI tools in higher education has received considerable attention. Although there are many proponents of its value as a learning tool, many are concerned with the issues regarding academic integrity and its use by students to compose written assessments. This study evaluates and compares the output of three commonly used generative AI tools, ChatGPT, Bing and Bard. Each AI tool was prompted with an essay question from undergraduate (UG) level 4 (year 1), level 5 (year 2), level 6 (year 3) and postgraduate (PG) level 7 biomedical sciences courses. Anonymised AI generated output was then evaluated by four independent markers, according to specified marking criteria and matched to the Frameworks for Higher Education Qualifications (FHEQ) of UK level descriptors. Percentage scores and ordinal grades were given for each marking criteria across AI generated papers, inter-rater reliability was calculated using Kendall’s coefficient of concordance and generative AI performance ranked. Across all UG and PG levels, ChatGPT performed better than Bing or Bard in areas of scientific accuracy, scientific detail and context. All AI tools performed consistently well at PG level compared to UG level, although only ChatGPT consistently met levels of high attainment at all UG levels. ChatGPT and Bing did not provide adequate references, while Bing falsified references. In conclusion, generative AI tools are useful for providing scientific information consistent with the academic standards required of students in written assignments. These findings have broad implications for the design, implementation and grading of written assessments in higher education.

Abstract Image

查看原文本刊更多论文

比较生成式人工智能在生物医学本科生和研究生书面评估中的表现

生成式人工智能工具在高等教育中的价值受到了广泛关注。尽管有很多人支持其作为学习工具的价值，但也有很多人担心学术诚信问题以及学生使用它来撰写书面评估报告的问题。本研究对 ChatGPT、Bing 和 Bard 这三种常用的生成式人工智能工具的输出结果进行了评估和比较。每个人工智能工具都以本科（UG）4 级（1 年级）、5 级（2 年级）、6 级（3 年级）和研究生（PG）7 级生物医学课程中的作文题为提示。然后，由四名独立阅卷员根据指定的评分标准和英国高等教育资格框架（FHEQ）的等级描述进行匿名人工智能生成输出评估。对人工智能生成的论文的每个评分标准都给出了百分比分数和序数等级，使用肯德尔一致系数计算了评分者之间的可靠性，并对人工智能的生成性能进行了排名。在所有 UG 和 PG 级别中，ChatGPT 在科学准确性、科学细节和上下文方面的表现均优于 Bing 或 Bard。与 UG 水平相比，所有人工智能工具在 PG 水平上的表现都一直很好，但只有 ChatGPT 在所有 UG 水平上都一直达到高水平。ChatGPT 和 Bing 没有提供足够的参考文献，而 Bing 则伪造了参考文献。总之，生成式人工智能工具有助于提供符合学生书面作业学术标准的科学信息。这些发现对高等教育中书面评估的设计、实施和评分具有广泛的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Educational Technology in Higher Education 计算机科学-计算机科学应用

CiteScore

19.30

自引率

4.70%

发文量

审稿时长

76.7 days

期刊介绍： This journal seeks to foster the sharing of critical scholarly works and information exchange across diverse cultural perspectives in the fields of technology-enhanced and digital learning in higher education. It aims to advance scientific knowledge on the human and personal aspects of technology use in higher education, while keeping readers informed about the latest developments in applying digital technologies to learning, training, research, and management.