Predictive Fit Metrics for Item Response Models.

IF 1 4区心理学 Q4 PSYCHOLOGY, MATHEMATICAL

Applied Psychological Measurement Pub Date : 2022-03-01 Epub Date: 2022-02-13 DOI:10.1177/01466216211066603

Benjamin A Stenhaug, Benjamin W Domingue

{"title":"Predictive Fit Metrics for Item Response Models.","authors":"Benjamin A Stenhaug, Benjamin W Domingue","doi":"10.1177/01466216211066603","DOIUrl":null,"url":null,"abstract":"<p><p>The fit of an item response model is typically conceptualized as whether a given model could have generated the data. In this study, for an alternative view of fit, \"predictive fit,\" based on the model's ability to predict new data is advocated. The authors define two prediction tasks: \"missing responses prediction\"-where the goal is to predict an in-sample person's response to an in-sample item-and \"missing persons prediction\"-where the goal is to predict an out-of-sample person's string of responses. Based on these prediction tasks, two predictive fit metrics are derived for item response models that assess how well an estimated item response model fits the data-generating model. These metrics are based on long-run out-of-sample predictive performance (i.e., if the data-generating model produced infinite amounts of data, what is the quality of a \"model's predictions on average?\"). Simulation studies are conducted to identify the prediction-maximizing model across a variety of conditions. For example, defining prediction in terms of missing responses, greater average person ability, and greater item discrimination are all associated with the 3PL model producing relatively worse predictions, and thus lead to greater minimum sample sizes for the 3PL model. In each simulation, the prediction-maximizing model to the model selected by Akaike's information criterion, Bayesian information criterion (BIC), and likelihood ratio tests are compared. It is found that performance of these methods depends on the prediction task of interest. In general, likelihood ratio tests often select overly flexible models, while BIC selects overly parsimonious models. The authors use Programme for International Student Assessment data to demonstrate how to use cross-validation to directly estimate the predictive fit metrics in practice. The implications for item response model selection in operational settings are discussed.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"46 2","pages":"136-155"},"PeriodicalIF":1.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8908407/pdf/10.1177_01466216211066603.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Psychological Measurement","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1177/01466216211066603","RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/2/13 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"PSYCHOLOGY, MATHEMATICAL","Score":null,"Total":0}

引用次数: 0

Abstract

The fit of an item response model is typically conceptualized as whether a given model could have generated the data. In this study, for an alternative view of fit, "predictive fit," based on the model's ability to predict new data is advocated. The authors define two prediction tasks: "missing responses prediction"-where the goal is to predict an in-sample person's response to an in-sample item-and "missing persons prediction"-where the goal is to predict an out-of-sample person's string of responses. Based on these prediction tasks, two predictive fit metrics are derived for item response models that assess how well an estimated item response model fits the data-generating model. These metrics are based on long-run out-of-sample predictive performance (i.e., if the data-generating model produced infinite amounts of data, what is the quality of a "model's predictions on average?"). Simulation studies are conducted to identify the prediction-maximizing model across a variety of conditions. For example, defining prediction in terms of missing responses, greater average person ability, and greater item discrimination are all associated with the 3PL model producing relatively worse predictions, and thus lead to greater minimum sample sizes for the 3PL model. In each simulation, the prediction-maximizing model to the model selected by Akaike's information criterion, Bayesian information criterion (BIC), and likelihood ratio tests are compared. It is found that performance of these methods depends on the prediction task of interest. In general, likelihood ratio tests often select overly flexible models, while BIC selects overly parsimonious models. The authors use Programme for International Student Assessment data to demonstrate how to use cross-validation to directly estimate the predictive fit metrics in practice. The implications for item response model selection in operational settings are discussed.

Abstract Image

查看原文本刊更多论文

项目反应模型的预测拟合度量。

项目反应模型的拟合度通常被理解为一个给定的模型是否能够生成数据。在本研究中，作者提出了另一种拟合度观点，即基于模型预测新数据能力的 "预测拟合度"。作者定义了两种预测任务："缺失反应预测"--目标是预测样本内人员对样本内项目的反应；"缺失人员预测"--目标是预测样本外人员的一连串反应。基于这些预测任务，我们得出了两个项目反应模型的预测拟合度量，用于评估估计的项目反应模型与数据生成模型的拟合程度。这些指标基于长期的样本外预测性能（即如果数据生成模型产生了无限量的数据，那么 "模型的平均预测质量如何？）进行模拟研究是为了确定各种条件下的预测最大化模型。例如，根据缺失的回答、更高的平均个人能力和更高的项目区分度来定义预测，都会使 3PL 模型产生相对较差的预测结果，从而导致 3PL 模型的最小样本量增大。在每次模拟中，都会比较预测最大化模型与阿凯克信息准则、贝叶斯信息准则（BIC）和似然比检验所选择的模型。结果发现，这些方法的性能取决于所关注的预测任务。一般来说，似然比检验往往选择过于灵活的模型，而贝叶斯信息准则则选择过于简单的模型。作者利用国际学生评估项目的数据演示了如何在实践中使用交叉验证来直接估计预测拟合度量。作者还讨论了在实际操作中选择项目反应模型的意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Psychological Measurement Multiple-

CiteScore

2.30

自引率

8.30%

发文量

期刊介绍： Applied Psychological Measurement publishes empirical research on the application of techniques of psychological measurement to substantive problems in all areas of psychology and related disciplines.