Daniela S M Pereira, Francisco Mourão, João Carlos Ribeiro, Patrício Costa, Serafim Guimarães, José Miguel Pêgo
{"title":"ChatGPT as an item calibration tool: Psychometric insights in a high-stakes examination.","authors":"Daniela S M Pereira, Francisco Mourão, João Carlos Ribeiro, Patrício Costa, Serafim Guimarães, José Miguel Pêgo","doi":"10.1080/0142159X.2024.2376205","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>ChatGPT has attracted a lot of interest worldwide for its versatility in a range of natural language tasks, including in the education and evaluation industry. It can automate time- and labor-intensive tasks with clear economic and efficiency gains.</p><p><strong>Methods: </strong>This study evaluated the potential of ChatGPT to automate psychometric analysis of test questions from the 2020 Portuguese National Residency Selection Exam (PNA). ChatGPT was queried 100 times with the 150 MCQ from the exam. Using ChatGPT's responses, difficulty indices were calculated for each question based on the proportion of correct answers. The predicted difficulty levels were compared to the actual difficulty levels of the 2020 exam MCQ's using methods from classical test theory.</p><p><strong>Results: </strong>ChatGPT's predicted item difficulty indices positively correlated with the actual item difficulties (r (148) = -0.372, <i>p</i> < .001), suggesting a general consistency between the real and the predicted values. There was also a moderate significant negative correlation between the difficulty index predicted by ChatGPT and the number of challenges (r (148) = -0.302, <i>p</i> < .001), highlighting ChatGPT's potential for identifying less problematic questions.</p><p><strong>Conclusion: </strong>These findings unveiled ChatGPT's potential as a tool for assessment development, proving its capability to predict the psychometric characteristics of high-stakes test items in automated item calibration without pre-testing in real-life scenarios.</p>","PeriodicalId":18643,"journal":{"name":"Medical Teacher","volume":" ","pages":"677-683"},"PeriodicalIF":3.3000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Teacher","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1080/0142159X.2024.2376205","RegionNum":2,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/16 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: ChatGPT has attracted a lot of interest worldwide for its versatility in a range of natural language tasks, including in the education and evaluation industry. It can automate time- and labor-intensive tasks with clear economic and efficiency gains.
Methods: This study evaluated the potential of ChatGPT to automate psychometric analysis of test questions from the 2020 Portuguese National Residency Selection Exam (PNA). ChatGPT was queried 100 times with the 150 MCQ from the exam. Using ChatGPT's responses, difficulty indices were calculated for each question based on the proportion of correct answers. The predicted difficulty levels were compared to the actual difficulty levels of the 2020 exam MCQ's using methods from classical test theory.
Results: ChatGPT's predicted item difficulty indices positively correlated with the actual item difficulties (r (148) = -0.372, p < .001), suggesting a general consistency between the real and the predicted values. There was also a moderate significant negative correlation between the difficulty index predicted by ChatGPT and the number of challenges (r (148) = -0.302, p < .001), highlighting ChatGPT's potential for identifying less problematic questions.
Conclusion: These findings unveiled ChatGPT's potential as a tool for assessment development, proving its capability to predict the psychometric characteristics of high-stakes test items in automated item calibration without pre-testing in real-life scenarios.
期刊介绍:
Medical Teacher provides accounts of new teaching methods, guidance on structuring courses and assessing achievement, and serves as a forum for communication between medical teachers and those involved in general education. In particular, the journal recognizes the problems teachers have in keeping up-to-date with the developments in educational methods that lead to more effective teaching and learning at a time when the content of the curriculum—from medical procedures to policy changes in health care provision—is also changing. The journal features reports of innovation and research in medical education, case studies, survey articles, practical guidelines, reviews of current literature and book reviews. All articles are peer reviewed.