{"title":"Gender Bias in Test Item Formats: Evidence from PISA 2009, 2012, and 2015 Math and Reading Tests","authors":"Benjamin R. Shear","doi":"10.1111/jedm.12372","DOIUrl":"10.1111/jedm.12372","url":null,"abstract":"<p>Large-scale standardized tests are regularly used to measure student achievement overall and for student subgroups. These uses assume tests provide comparable measures of outcomes across student subgroups, but prior research suggests score comparisons across gender groups may be complicated by the type of test items used. This paper presents evidence that among nationally representative samples of 15-year-olds in the United States participating in the 2009, 2012, and 2015 PISA math and reading tests, there are consistent item format by gender differences. On average, male students answer multiple-choice items correctly relatively more often and female students answer constructed-response items correctly relatively more often. These patterns were consistent across 34 additional participating PISA jurisdictions, although the size of the format differences varied and were larger on average in reading than math. The average magnitude of the format differences is not large enough to be flagged in routine differential item functioning analyses intended to detect test bias but is large enough to raise questions about the validity of inferences based on comparisons of scores across gender groups. Researchers and other test users should account for test item format, particularly when comparing scores across gender groups.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 4","pages":"676-696"},"PeriodicalIF":1.3,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42035945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting Differential Item Functioning in CAT Using IRT Residual DIF Approach","authors":"Hwanggyu Lim, Edison M. Choe","doi":"10.1111/jedm.12366","DOIUrl":"10.1111/jedm.12366","url":null,"abstract":"<p>The residual differential item functioning (RDIF) detection framework was developed recently under a linear testing context. To explore the potential application of this framework to computerized adaptive testing (CAT), the present study investigated the utility of the RDIF<sub>R</sub> statistic both as an index for detecting uniform DIF of pretest items in CAT and as a direct measure of the effect size of uniform DIF. Extensive CAT simulations revealed RDIF<sub>R</sub> to have well-controlled Type I error and slightly higher power to detect uniform DIF compared with CATSIB, especially when pretest items were calibrated using fixed-item parameter calibration. Moreover, RDIF<sub>R</sub> accurately estimated the amount of uniform DIF irrespective of the presence of impact. Therefore, RDIF<sub>R</sub> demonstrates its potential as a useful tool for evaluating both the statistical and practical significance of uniform DIF in CAT.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 4","pages":"626-650"},"PeriodicalIF":1.3,"publicationDate":"2023-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45693936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin Becker, Sebastian Weirich, Frank Goldhammer, Dries Debeer
{"title":"Controlling the Speededness of Assembled Test Forms: A Generalization to the Three-Parameter Lognormal Response Time Model","authors":"Benjamin Becker, Sebastian Weirich, Frank Goldhammer, Dries Debeer","doi":"10.1111/jedm.12364","DOIUrl":"10.1111/jedm.12364","url":null,"abstract":"<p>When designing or modifying a test, an important challenge is controlling its speededness. To achieve this, van der Linden (2011a, 2011b) proposed using a lognormal response time model, more specifically the two-parameter lognormal model, and automated test assembly (ATA) via mixed integer linear programming. However, this approach has a severe limitation, in that the two-parameter lognormal model lacks a slope parameter. This means that the model assumes that all items are equally speed sensitive. From a conceptual perspective, this assumption seems very restrictive. Furthermore, various other empirical studies and new data analyses performed by us show that this assumption almost never holds in practice. To overcome this shortcoming, we bring together the already frequently used three-parameter lognormal model for response times, which contains a slope parameter, and the ATA approach for controlling speededness by van der Linden. Using multiple empirically based illustrations, the proposed extension is illustrated, including complete and documented R code. Both the original van der Linden approach and our newly proposed approach are available to practitioners in the freely available R package <span>eatATA</span>.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 4","pages":"551-574"},"PeriodicalIF":1.3,"publicationDate":"2023-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12364","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49199830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Note on Latent Traits Estimates under IRT Models with Missingness","authors":"Jinxin Guo, Xin Xu, Tao Xin","doi":"10.1111/jedm.12365","DOIUrl":"10.1111/jedm.12365","url":null,"abstract":"<p>Missingness due to not-reached items and omitted items has received much attention in the recent psychometric literature. Such missingness, if not handled properly, would lead to biased parameter estimation, as well as inaccurate inference of examinees, and further erode the validity of the test. This paper reviews some commonly used IRT based models allowing missingness, followed by three popular examinee scoring methods, including maximum likelihood estimation, maximum a posteriori, and expected a posteriori. Simulation studies were conducted to compare these examinee scoring methods across these commonly used models in the presence of missingness. Results showed that all the methods could infer examinees' ability accurately when the missingness is ignorable. If the missingness is nonignorable, incorporating those missing responses would improve the precision in estimating abilities for examinees with missingness, especially when the test length is short. In terms of examinee scoring methods, expected a posteriori method performed better for evaluating latent traits under models allowing missingness. An empirical study based on the PISA 2015 Science Test was further performed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 4","pages":"575-625"},"PeriodicalIF":1.3,"publicationDate":"2023-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44924100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Monitoring of Test-Taking Behavior Based on Item Responses and Response Times","authors":"Suhwa Han, Hyeon-Ah Kang","doi":"10.1111/jedm.12367","DOIUrl":"10.1111/jedm.12367","url":null,"abstract":"<p>The study presents multivariate sequential monitoring procedures for examining test-taking behaviors online. The procedures monitor examinee's responses and response times and signal aberrancy as soon as significant change is identifieddetected in the test-taking behavior. The study in particular proposes three schemes to track different indicators of a test-taking mode—the observable manifest variables, latent trait variables, and measurement likelihood. For each procedure, sequential sampling strategies are presented to implement online monitoring. Numerical experimentation based on simulated data suggests that the proposed procedures demonstrate adequate performance. The procedures identified examinees with aberrant behaviors with high detection power and timeliness, while maintaining error rates reasonably small. Experimental application to real data also suggested that the procedures have practical relevance to real assessments. Based on the observations from the experiential analysis, the study discusses implications and guidelines for practical use.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 4","pages":"651-675"},"PeriodicalIF":1.3,"publicationDate":"2023-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46552013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting Group Collaboration Using Multiple Correspondence Analysis","authors":"Joseph H. Grochowalski, Amy Hendrickson","doi":"10.1111/jedm.12363","DOIUrl":"10.1111/jedm.12363","url":null,"abstract":"<p>Test takers wishing to gain an unfair advantage often share answers with other test takers, either sharing all answers (a full key) or some (a partial key). Detecting key sharing during a tight testing window requires an efficient, easily interpretable, and rich form of analysis that is descriptive and inferential. We introduce a detection method based on multiple correspondence analysis (MCA) that identifies test takers with unusual response similarities. The method simultaneously detects multiple shared keys (partial or full), plots results, and is computationally efficient as it requires only matrix operations. We describe the method, evaluate its detection accuracy under various simulation conditions, and demonstrate the procedure on a real data set with known test-taking misbehavior. The simulation results showed that the MCA method had reasonably high power under realistic conditions and maintained the nominal false-positive level, except when the group size was very large or partial shared keys had more than 50% of the items. The real data analysis illustrated visual detection procedures and inference about the item responses possibly shared in the key, which was likely shared among 91 test takers, many of whom were confirmed by nonstatistical investigation to have engaged in test-taking misconduct.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"402-427"},"PeriodicalIF":1.3,"publicationDate":"2023-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44728675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pretest Item Calibration in Computerized Multistage Adaptive Testing","authors":"Rabia Karatoprak Ersen, Won-Chan Lee","doi":"10.1111/jedm.12361","DOIUrl":"10.1111/jedm.12361","url":null,"abstract":"<p>The purpose of this study was to compare calibration and linking methods for placing pretest item parameter estimates on the item pool scale in a 1-3 computerized multistage adaptive testing design in terms of item parameter recovery. Two models were used: embedded-section, in which pretest items were administered within a separate module, and embedded-items, in which pretest items were distributed across operational modules. The calibration methods were separate calibration with linking (SC) and fixed calibration (FC) with three parallel approaches under each (FC-1 and SC-1; FC-2 and SC-2; and FC-3 and SC-3). The FC-1 and SC-1 used only operational items in the routing module to link pretest items. The FC-2 and SC-2 also used only operational items in the routing module for linking, but in addition, the operational items in second stage modules were freely estimated. The FC-3 and SC-3 used operational items in all modules to link pretest items. The third calibration approach (i.e., FC-3 and SC-3) yielded the best results. For all three approaches, SC outperformed FC in all study conditions which were module length, sample size and examinee distributions.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"379-401"},"PeriodicalIF":1.3,"publicationDate":"2023-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12361","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48133014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Classical Item Analysis from a Signal Detection Perspective","authors":"Lawrence T. DeCarlo","doi":"10.1111/jedm.12358","DOIUrl":"10.1111/jedm.12358","url":null,"abstract":"<p>A conceptualization of multiple-choice exams in terms of signal detection theory (SDT) leads to simple measures of item difficulty and item discrimination that are closely related to, but also distinct from, those used in classical item analysis (CIA). The theory defines a “true split,” depending on whether or not examinees know an item, and so it provides a basis for using total scores to split item tables, as done in CIA, while also clarifying benefits and limitations of the approach. The SDT item difficulty and discrimination measures differ from those used in CIA in that they explicitly consider the role of distractors and avoid limitations due to range restrictions. A new screening measure is also introduced. The measures are theoretically well-grounded and are simple to compute by hand calculations or with standard software for choice models; simulations show that they offer advantages over traditional measures.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"520-547"},"PeriodicalIF":1.3,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42654295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Corrigendum: A Residual-Based Differential Item Functioning Detection Framework in Item Response Theory","authors":"Hwanggyu Lim, Edison M. Choe, Kyung T. Han","doi":"10.1111/jedm.12362","DOIUrl":"10.1111/jedm.12362","url":null,"abstract":"<p>In the original article, it was written that “Then the MLE scoring and DIF analysis with RDIF statistics were performed using the <i>est_score</i> and <i>rdif</i> functions, respectively, in the R (R Core Team, 2019) package irtplay (p.90).” However, the irtplay package has been removed from the CRAN repository due to intellectual property (IP) violation issues. Instead, a new R package called irtQ (Lim & Wells, <span>2023</span>) has been released as a successor to irtplay. All IP issues have been resolved in irtQ, ensuring that the package is compliant with industry standards. https://doi.org/10.1111/jedm.12313</p><p>We would like to inform that the same functions of <i>est_score</i> and <i>rdif</i> used in the original study are also included in irtQ. Thus, it can be used as a replacement for irtplay. We apologize for any confusion caused by the previous version of the article.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 1","pages":"170"},"PeriodicalIF":1.3,"publicationDate":"2023-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12362","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44304771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jodi M. Casabianca, John R. Donoghue, Hyo Jeong Shin, Szu-Fu Chao, Ikkyu Choi
{"title":"Using Linkage Sets to Improve Connectedness in Rater Response Model Estimation","authors":"Jodi M. Casabianca, John R. Donoghue, Hyo Jeong Shin, Szu-Fu Chao, Ikkyu Choi","doi":"10.1111/jedm.12360","DOIUrl":"10.1111/jedm.12360","url":null,"abstract":"<p>Using item-response theory to model rater effects provides an alternative solution for rater monitoring and diagnosis, compared to using standard performance metrics. In order to fit such models, the ratings data must be sufficiently connected in order to estimate rater effects. Due to popular rating designs used in large-scale testing scenarios, there tends to be a large proportion of missing data, yielding sparse matrices and estimation issues. In this article, we explore the impact of different types of connectedness, or linkage, brought about by using a linkage set—a collection of responses scored by most or all raters. We also explore the impact of the properties and composition of the linkage set, the different connectedness yielded from different rating designs, and the role of scores from automated scoring engines. In designing monitoring systems using the rater response version of the generalized partial credit model, the study results suggest use of a linkage set, especially a large one that is comprised of responses representing the full score scale. Results also show that a double-human-scoring design provides more connectedness than a design with one human and an automated scoring engine. Furthermore, scores from automated scoring engines do not provide adequate connectedness. We discuss considerations for operational implementation and further study.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"428-454"},"PeriodicalIF":1.3,"publicationDate":"2023-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47979944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}