{"title":"Item Selection With Collaborative Filtering in On-The-Fly Multistage Adaptive Testing.","authors":"Jiaying Xiao, Okan Bulut","doi":"10.1177/01466216221124089","DOIUrl":"https://doi.org/10.1177/01466216221124089","url":null,"abstract":"<p><p>An important design feature in the implementation of both computerized adaptive testing and multistage adaptive testing is the use of an appropriate method for item selection. The item selection method is expected to select the most optimal items depending on the examinees' ability level while considering other design features (e.g., item exposure and item bank utilization). This study introduced collaborative filtering (CF) as a new method for item selection in the <i>on-the-fly assembled multistage adaptive testing</i> framework. The user-based CF (UBCF) and item-based CF (IBCF) methods were compared to the maximum Fisher information method based on the accuracy of ability estimation, item exposure rates, and item bank utilization under different test conditions (e.g., item bank size, test length, and the sparseness of training data). The simulation results indicated that the UBCF method outperformed the traditional item selection methods regarding measurement accuracy. Also, the IBCF method showed the most superior performance in terms of item bank utilization. Limitations of the current study and the directions for future research are discussed.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"46 8","pages":"690-704"},"PeriodicalIF":1.2,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/09/ba/10.1177_01466216221124089.PMC9574085.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40656645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flexible Item Response Models for Count Data: The Count Thresholds Model.","authors":"Gerhard Tutz","doi":"10.1177/01466216221108124","DOIUrl":"10.1177/01466216221108124","url":null,"abstract":"<p><p>A new item response theory model for count data is introduced. In contrast to models in common use, it does not assume a fixed distribution for the responses as, for example, the Poisson count model and extensions do. The distribution of responses is determined by difficulty functions which reflect the characteristics of items in a flexible way. Sparse parameterizations are obtained by choosing fixed parametric difficulty functions, more general versions use an approximation by basis functions. The model can be seen as constructed from binary response models as the Rasch model or the normal-ogive model to which it reduces if responses are dichotomized. It is demonstrated that the model competes well with advanced count data models. Simulations demonstrate that parameters and response distributions are recovered well. An application shows the flexibility of the model to account for strongly varying distributions of responses.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"46 8","pages":"643-661"},"PeriodicalIF":1.2,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9574081/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40573824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Empirical Identification Issue of the Bifactor Item Response Theory Model.","authors":"Wenya Chen, Ken A Fujimoto","doi":"10.1177/01466216221108133","DOIUrl":"10.1177/01466216221108133","url":null,"abstract":"<p><p>Using the bifactor item response theory model to analyze data arising from educational and psychological studies has gained popularity over the years. Unfortunately, using this model in practice comes with challenges. One such challenge is an empirical identification issue that is seldom discussed in the literature, and its impact on the estimates of the bifactor model's parameters has not been demonstrated. This issue occurs when an item's discriminations on the general and specific dimensions are approximately equal (i.e., the within-item discriminations are similar in strength), leading to difficulties in obtaining unique estimates for those discriminations. We conducted three simulation studies to demonstrate that within-item discriminations being similar in strength creates problems in estimation stability. The results suggest that a large sample could alleviate but not resolve the problems, at least when considering sample sizes up to 4,000. When the discriminations within items were made clearly different, the estimates of these discriminations were more consistent across the data replicates than that observed when the discriminations within the items were similar. The results also show that the similarity of an item's discriminatory magnitudes on different dimensions has direct implications on the sample size needed in order to consistently obtain accurate parameter estimates. Although our goal was to provide evidence of the empirical identification issue, the study further reveals that the extent of similarity of within-item discriminations, the magnitude of discriminations, and how well the items are targeted to the respondents also play factors in the estimation of the bifactor model's parameters.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"46 8","pages":"675-689"},"PeriodicalIF":1.2,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9574084/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40656647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modified Item-Fit Indices for Dichotomous IRT Models with Missing Data.","authors":"Xue Zhang, Chun Wang","doi":"10.1177/01466216221125176","DOIUrl":"10.1177/01466216221125176","url":null,"abstract":"<p><p>Item-level fit analysis not only serves as a complementary check to global fit analysis, it is also essential in scale development because the fit results will guide item revision and/or deletion (Liu & Maydeu-Olivares, 2014). During data collection, missing response data may likely happen due to various reasons. Chi-square-based item fit indices (e.g., Yen's <i>Q</i> <sub><i>1</i></sub> , McKinley and Mill's <i>G</i> <sup><i>2</i></sup> , Orlando and Thissen's <i>S-X</i> <sup><i>2</i></sup> and <i>S-G</i> <sup><i>2</i></sup> ) are the most widely used statistics to assess item-level fit. However, the role of total scores with complete data used in <i>S-X</i> <sup><i>2</i></sup> and <i>S-G</i> <sup><i>2</i></sup> is different from that with incomplete data. As a result, <i>S-X</i> <sup><i>2</i></sup> and <i>S-G</i> <sup><i>2</i></sup> cannot handle incomplete data directly. To this end, we propose several modified versions of <i>S-X</i> <sup><i>2</i></sup> and <i>S-G</i> <sup><i>2</i></sup> to evaluate item-level fit when response data are incomplete, named as <i>M</i> <sub><i>impute</i></sub> <i>-X</i> <sup><i>2</i></sup> and <i>M</i> <sub><i>impute</i></sub> <i>-G</i> <sup><i>2</i></sup> , of which the subscript \"<i>impute</i>\" denotes different imputation methods. Instead of using observed total scores for grouping, the new indices rely on imputed total scores by either a single imputation method or three multiple imputation methods (i.e., two-way with normally distributed errors, corrected item-mean substitution with normally distributed errors and response function imputation). The new indices are equivalent to <i>S-X</i> <sup><i>2</i></sup> and <i>S-G</i> <sup><i>2</i></sup> when response data are complete. Their performances are evaluated and compared via simulation studies; the manipulated factors include test length, sources of misfit, misfit proportion, and missing proportion. The results from simulation studies are consistent with those of Orlando and Thissen (2000, 2003), and different indices are recommended under different conditions.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"46 8","pages":"705-719"},"PeriodicalIF":1.2,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9574083/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40656646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Diagnostic Classification Models for a Mixture of Ordered and Non-ordered Response Options in Rating Scales.","authors":"Ren Liu, Haiyan Liu, Dexin Shi, Zhehan Jiang","doi":"10.1177/01466216221108132","DOIUrl":"10.1177/01466216221108132","url":null,"abstract":"<p><p>When developing ordinal rating scales, we may include potentially unordered response options such as \"Neither Agree nor Disagree,\" \"Neutral,\" \"Don't Know,\" \"No Opinion,\" or \"Hard to Say.\" To handle responses to a mixture of ordered and unordered options, Huggins-Manley et al. (2018) proposed a class of semi-ordered models under the unidimensional item response theory framework. This study extends the concept of semi-ordered models into the area of diagnostic classification models. Specifically, we propose a flexible framework of semi-ordered DCMs that accommodates most earlier DCMs and allows for analyzing the relationship between those potentially unordered responses and the measured traits. Results from an operational study and two simulation studies show that the proposed framework can incorporate both ordered and non-ordered responses into the estimation of the latent traits and thus provide useful information about both the items and the respondents.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"46 7","pages":"622-639"},"PeriodicalIF":1.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/f0/84/10.1177_01466216221108132.PMC9483220.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33466446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Optimal Design of Bifactor Multidimensional Computerized Adaptive Testing with Mixed-format Items.","authors":"Xiuzhen Mao, Jiahui Zhang, Tao Xin","doi":"10.1177/01466216221108382","DOIUrl":"10.1177/01466216221108382","url":null,"abstract":"<p><p>Multidimensional computerized adaptive testing (MCAT) using mixed-format items holds great potential for the next-generation assessments. Two critical factors in the mixed-format test design (i.e., the order and proportion of polytomous items) and item selection were addressed in the context of mixed-format bifactor MCAT. For item selection, this article presents the derivation of the Fisher information matrix of the bifactor graded response model and the application of the bifactor dimension reduction method to simplify the computation of the mutual information (MI) item selection method. In a simulation study, different MCAT designs were compared with varying proportions of polytomous items (0.2-0.6, 1), different item-delivering formats (DPmix: delivering polytomous items at the final stage; RPmix: random delivering), three bifactor patterns (low, middle, and high), and two item selection methods (Bayesian D-optimality and MI). Simulation results suggested that a) the overall estimation precision increased with a higher bifactor pattern; b) the two item selection methods did not show substantial differences in estimation precision; and c) the RPmix format always led to more precise interim and final estimates than the DPmix format. The proportions of 0.3 and 0.4 were recommended for the RPmix and DPmix formats, respectively.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"46 7","pages":"605-621"},"PeriodicalIF":1.2,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9483217/pdf/10.1177_01466216221108382.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33466926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Uncovering the Complexity of Item Position Effects in a Low-Stakes Testing Context.","authors":"Thai Q Ong, Dena A Pastor","doi":"10.1177/01466216221108134","DOIUrl":"10.1177/01466216221108134","url":null,"abstract":"<p><p>Previous researchers have only either adopted an item or examinee perspective to position effects, where they focused on exploring the relationships among position effects and item or examinee variables separately. Unlike previous researchers, we adopted an integrated perspective to position effects, where we focused on exploring the relationships among position effects, item variables, and examinee variables simultaneously. We evaluated the degree to which position effects on two separate low-stakes tests administered to two different samples were moderated by different item (item length, number of response options, mental taxation, and graphic) and examinee (effort, change in effort, and gender) variables. Items exhibited significant negative linear position effects on both tests, with the magnitude of the position effects varying from item to item. Longer items were more prone to position effects than shorter items; however, the level of mental taxation required to answer the item, the presence of a graphic, and the number of response options were not related to position effects. Examinee effort levels, change in effort patterns, and gender did not moderate the relationships among position effects and item features.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"46 7","pages":"571-588"},"PeriodicalIF":1.2,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9483218/pdf/10.1177_01466216221108134.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33466447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Termination Criteria for Grid Multiclassification Adaptive Testing With Multidimensional Polytomous Items.","authors":"Zhuoran Wang, Chun Wang, David J Weiss","doi":"10.1177/01466216221108383","DOIUrl":"10.1177/01466216221108383","url":null,"abstract":"<p><p>Adaptive classification testing (ACT) is a variation of computerized adaptive testing (CAT) that is developed to efficiently classify examinees into multiple groups based on predetermined cutoffs. In multidimensional multiclassification (i.e., more than two categories exist along each dimension), grid classification is proposed to classify each examinee into one of the grids encircled by cutoffs (lines/surfaces) along different dimensions so as to provide clearer information regarding an examinee's relative standing along each dimension and facilitate subsequent treatment and intervention. In this article, the sequential probability ratio test (SPRT) and confidence interval method were implemented in the grid multiclassification ACT. In addition, two new termination criteria, the grid classification generalized likelihood ratio (GGLR) and simplified grid classification generalized likelihood ratio were proposed for grid multiclassification ACT. Simulation studies, using a simulated item bank, and a real item bank with polytomous multidimensional items, show that grid multiclassification ACT is more efficient than classification based on measurement CAT that focuses on trait estimate precision. In the context of a high-quality bank, GGLR was found to most efficiently terminate the grid multiclassification ACT and classify examinees.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"46 7","pages":"551-570"},"PeriodicalIF":1.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9483219/pdf/10.1177_01466216221108383.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33466449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigating the Effect of Differential Rapid Guessing on Population Invariance in Equating.","authors":"Jiayi Deng, Joseph A Rios","doi":"10.1177/01466216221108991","DOIUrl":"10.1177/01466216221108991","url":null,"abstract":"<p><p>Score equating is an essential tool in improving the fairness of test score interpretations when employing multiple test forms. To ensure that the equating functions used to connect scores from one form to another are valid, they must be invariant across different populations of examinees. Given that equating is used in many low-stakes testing programs, examinees' test-taking effort should be considered carefully when evaluating population invariance in equating, particularly as the occurrence of rapid guessing (RG) has been found to differ across subgroups. To this end, the current study investigated whether differential RG rates between subgroups can lead to incorrect inferences concerning population invariance in test equating. A simulation was built to generate data for two examinee subgroups (one more motivated than the other) administered two alternative forms of multiple-choice items. The rate of RG and ability characteristics of rapid guessers were manipulated. Results showed that as RG responses increased, false positive and false negative inferences of equating invariance were respectively observed at the lower and upper ends of the observed score scale. This result was supported by an empirical analysis of an international assessment. These findings suggest that RG should be investigated and documented prior to test equating, especially in low-stakes assessment contexts. A failure to do so may lead to incorrect inferences concerning fairness in equating.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"46 7","pages":"589-604"},"PeriodicalIF":1.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9483216/pdf/10.1177_01466216221108991.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33466450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Leslie Rutkowski, Yuan-Ling Liaw, Dubravka Svetina, David Rutkowski
{"title":"Multistage Testing in Heterogeneous Populations: Some Design and Implementation Considerations.","authors":"Leslie Rutkowski, Yuan-Ling Liaw, Dubravka Svetina, David Rutkowski","doi":"10.1177/01466216221108123","DOIUrl":"https://doi.org/10.1177/01466216221108123","url":null,"abstract":"<p><p>A central challenge in international large-scale assessments is adequately measuring dozens of highly heterogeneous populations, many of which are low performers. To that end, multistage adaptive testing offers one possibility for better assessing across the achievement continuum. This study examines the way that several multistage test design and implementation choices can impact measurement performance in this setting. To attend to gaps in the knowledge base, we extended previous research to include multiple, linked panels, more appropriate estimates of achievement, and multiple populations of varied proficiency. Including achievement distributions from varied populations and associated item parameters, we design and execute a simulation study that mimics an established international assessment. We compare several routing schemes and varied module lengths in terms of item and person parameter recovery. Our findings suggest that, particularly for low performing populations, multistage testing offers precision advantages. Further, findings indicate that equal module lengths-desirable for controlling position effects-and classical routing methods, which lower the technological burden of implementing such a design, produce good results. Finally, probabilistic misrouting offers advantages over merit routing for controlling bias in item and person parameters. Overall, multistage testing shows promise for extending the scope of international assessments. We discuss the importance of our findings for operational work in the international assessment domain.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":"46 6","pages":"494-508"},"PeriodicalIF":1.2,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9382094/pdf/10.1177_01466216221108123.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10189453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}