{"title":"Maintaining Score Scales Over Time: A Comparison of Five Scoring Methods","authors":"S. Y. Kim, Won‐Chan Lee","doi":"10.1080/08957347.2023.2172015","DOIUrl":"https://doi.org/10.1080/08957347.2023.2172015","url":null,"abstract":"ABSTRACT This study evaluates various scoring methods including number-correct scoring, IRT theta scoring, and hybrid scoring in terms of scale-score stability over time. A simulation study was conducted to examine the relative performance of five scoring methods in terms of preserving the first two moments of scale scores for a population in a chain of linking with multiple test forms. Simulation factors included 1) the number of forms linked back to the initial form, 2) the pattern in mean shift, and 3) the proportion of common items. Results showed that scoring methods that operate with number-correct scores generally outperform those that are based on IRT proficiency estimators ( ) in terms of reproducing the mean and standard deviation of scale scores. Scoring methods performed differently as a function of patterns in a group proficiency change.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"60 - 79"},"PeriodicalIF":1.5,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46970807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accuracy and Sensitivity of Coefficient Alpha and Its Alternatives with Unidimensional and Contaminated Scales","authors":"Leifeng Xiao, K. Hau","doi":"10.1080/08957347.2023.2172016","DOIUrl":"https://doi.org/10.1080/08957347.2023.2172016","url":null,"abstract":"ABSTRACT We compared coefficient alpha with five alternatives (omega total, omega RT, omega h, GLB, and coefficient H) in two simulation studies. Results showed for unidimensional scales, (a) all indices except omega h performed similarly well for most conditions; (b) alpha is still good; (c) GLB and coefficient H overestimated reliability with small samples and short scales, and (d) sensitivity to scale quality reduced with longer scales. For contaminated scales, (a) all indices except omega h were reasonably unbiased with non-severe contamination; (b) alpha, omega total, and GLB were more sensitive in picking up contamination with shorter scales, whereas omega RT and omega h were not; and (c) coefficient H could not pick up contaminated items among high-quality items. For applied researchers, (a) supplementary information of scale characteristics helps choose the appropriate index; (b) comparing different scales with one golden standard is inappropriate; (c) omega h should not be used alone.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"31 - 44"},"PeriodicalIF":1.5,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48520089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Bayesian Networks for Cognitive Assessment of Student Understanding of Buoyancy: A Granular Hierarchy Model","authors":"L. Wang, Sun Xiao Jian, Yan Lou Liu, Tao Xin","doi":"10.1080/08957347.2023.2172014","DOIUrl":"https://doi.org/10.1080/08957347.2023.2172014","url":null,"abstract":"ABSTRACT Cognitive diagnostic assessment based on Bayesian networks (BN) is developed in this paper to evaluate student understanding of the physical concept of buoyancy. we propose a three-order granular-hierarchy BN model which accounts for both fine-grained attributes and high-level proficiencies. Conditional independence in the BN structure is tested and utilized to validate the proposed model. The proficiency relationships are verified and the initial Q-matrix is refined. Then, an optimized granular hierarchy model is constructed based on the updated Q-matrix. All variants of the constructed models are evaluated on the basis of the prediction accuracy and the goodness-of-fit test. The experimental results demonstrate that the optimized granular-hierarchy model has the best prediction and model-fitting performance. In general, the BN method not only can provide more flexible modeling approach, but also can help validate or refine the proficiency model and the Q-matrix and this method has its unique advantage in cognitive diagnosis.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"45 - 59"},"PeriodicalIF":1.5,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49350798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Are Large Admissions Test Coaching Effects Widespread? A Longitudinal Analysis of Admissions Test Scores","authors":"Jeffrey A. Dahlke, P. Sackett, N. Kuncel","doi":"10.1080/08957347.2023.2172018","DOIUrl":"https://doi.org/10.1080/08957347.2023.2172018","url":null,"abstract":"ABSTRACT We examine longitudinal data from 120,384 students who took a version of the PSAT/SAT in the 9th, 10th, 11th, and 12th grades. We investigate score changes over time and show that socioeconomic status (SES) is related to the degree of score improvement. We note that the 9th and 10th grade PSAT are low-stakes tests, while the operational SAT is a high-stakes test. We posit that investments in coaching would be uncommon for early PSAT administrations, and would be concentrated on efforts to prepare for the operational SAT. We compare score improvements between 9th and 10th grade with improvements between 10th and 12th grade, examining results separately by level of SES. We find similar levels of score improvement in low-stakes and high-stakes settings, with 3.4% of high-SES and 1.1% of low-SES students showing larger-than-expected score improvements, which is inconsistent with claims that high-SES students have routine access to highly effective coaching.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"1 - 13"},"PeriodicalIF":1.5,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42421288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rashid M Abu-Ghazalah, David N Dubins, Gregory M K Poon
{"title":"Dissecting knowledge, guessing, and blunder in multiple choice assessments.","authors":"Rashid M Abu-Ghazalah, David N Dubins, Gregory M K Poon","doi":"10.1080/08957347.2023.2172017","DOIUrl":"10.1080/08957347.2023.2172017","url":null,"abstract":"<p><p>Multiple choice results are inherently probabilistic outcomes, as correct responses reflect a combination of knowledge and guessing, while incorrect responses additionally reflect blunder, a confidently committed mistake. To objectively resolve knowledge from responses in an MC test structure, we evaluated probabilistic models that explicitly account for guessing, knowledge and blunder using eight assessments (>9,000 responses) from an undergraduate biotechnology curriculum. A Bayesian implementation of the models, aimed at assessing their robustness to prior beliefs in examinee knowledge, showed that explicit estimators of knowledge are markedly sensitive to prior beliefs with scores as sole input. To overcome this limitation, we examined self-ranked confidence as a proxy knowledge indicator. For our test set, three levels of confidence resolved test performance. Responses rated as least confident were correct more frequently than expected from random selection, reflecting partial knowledge, but were balanced by blunder among the most confident responses. By translating evidence-based guessing and blunder rates to pass marks that statistically qualify a desired level of examinee knowledge, our approach finds practical utility in test analysis and design.</p>","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"36 1","pages":"80-98"},"PeriodicalIF":1.1,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10201919/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9522330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Personality Aspects and the Underprediction of Women’s Academic Performance","authors":"You Zhou, P. Sackett, Thomas Brothen","doi":"10.1080/08957347.2022.2155652","DOIUrl":"https://doi.org/10.1080/08957347.2022.2155652","url":null,"abstract":"ABSTRACT We sought to replicate prior findings that admissions tests’ underprediction of female college performance was driven in part by the omission of Big 5 personality factors from the predictive model, using 5,400 college students. We investigated gender differences in an elaborated model subdividing the Big 5 into ten aspects. We found differences at the aspect level that were not found at the factor level, and some aspects had unique relationships with academic outcomes. The findings demonstrated the effect of omitted variables on predictive bias.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"287 - 299"},"PeriodicalIF":1.5,"publicationDate":"2022-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46522162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Examination of Individual Ability Estimation and Classification Accuracy Under Rapid Guessing Misidentifications","authors":"Joseph A. Rios","doi":"10.1080/08957347.2022.2155653","DOIUrl":"https://doi.org/10.1080/08957347.2022.2155653","url":null,"abstract":"ABSTRACT To mitigate the deleterious effects of rapid guessing (RG) on ability estimates, several rescoring procedures have been proposed. Underlying many of these procedures is the assumption that RG is accurately identified. At present, there have been minimal investigations examining the utility of rescoring approaches when RG is misclassified, and individual scores are reported. To address this limitation, the present simulation study investigates the effect of RG misclassifications on individual examinee ability estimate bias and classification accuracy when using effort-moderated (EM) scoring. This objective is accomplished by manipulating simulee ability level, RG rate, as well as misclassification type and percentage. Results showed that EM scoring significantly improved ability inferences for examinees engaging in RG; however, the effectiveness of this approach was largely dependent on misclassification type. Specifically, across ability levels, bias tended to be on average lower when falsely classifying effortful responses as RG. Although EM scoring improved bias, it was susceptible to elevated false-positive classifications of ability under high RG.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"300 - 312"},"PeriodicalIF":1.5,"publicationDate":"2022-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42107151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparison of Methods for Identifying Differential Step Functioning with Polytomous Item Response Data","authors":"Holmes W. Finch","doi":"10.1080/08957347.2022.2155650","DOIUrl":"https://doi.org/10.1080/08957347.2022.2155650","url":null,"abstract":"ABSTRACT Much research has been devoted to identification of differential item functioning (DIF), which occurs when the item responses for individuals from two groups differ after they are conditioned on the latent trait being measured by the scale. There has been less work examining differential step functioning (DSF), which is present for polytomous items when the conditional likelihood of responses to specific categories differ between groups. DSF impacts estimation of the measured trait and reduces the effectiveness of standard DIF detection methods. The purpose of this simulation study was to extend upon earlier work by comparing several methods for detecting the presence of DSF in polytomous items, including an approach based on the lasso estimation of the generalized partial credit model. Results show that the lasso GPCM technique controlled the Type I error rate while yielding power rates somewhat lower than logistic regression and the MIMIC model, which were not able to control the Type I error rate in some conditions. An empirical example is also presented, and implications of this study for practice are discussed.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"255 - 271"},"PeriodicalIF":1.5,"publicationDate":"2022-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47299711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Decline as an Indicator of Generalized Test-Taking Disengagement","authors":"S. Wise, G. Kingsbury","doi":"10.1080/08957347.2022.2155651","DOIUrl":"https://doi.org/10.1080/08957347.2022.2155651","url":null,"abstract":"ABSTRACT In achievement testing we assume that students will demonstrate their maximum performance as they encounter test items. Sometimes, however, student performance can decline during a test event, which implies that the test score does not represent maximum performance. This study describes a method for identifying significant performance decline and investigated its utility as an indicator of generalized test-taking disengagement. Analysis of data from a computerized adaptive interim achievement test showed that performance decline classifications exhibited characteristics similar to those from disengagement classifications based on rapid guessing. More importantly, performance decline was found to identify disengagement by many students who would not have been identified as disengaged based on rapid-guessing behavior.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"272 - 286"},"PeriodicalIF":1.5,"publicationDate":"2022-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42114164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"When Should Individual Ability Estimates Be Reported if Rapid Guessing Is Present?","authors":"Joseph A. Rios","doi":"10.1080/08957347.2022.2103138","DOIUrl":"https://doi.org/10.1080/08957347.2022.2103138","url":null,"abstract":"<p><b>ABSTRACT</b></p><p>Testing programs are confronted with the decision of whether to report individual scores for examinees that have engaged in rapid guessing (RG). As noted by the <i>Standards for Educational and Psychological Testing</i>, this decision should be based on a documented criterion that determines score exclusion. To this end, a number of heuristic criteria (e.g., exclude all examinees with RG rates of 10%) have been adopted in the literature. Given that these criteria lack strong methodological support, the objective of this simulation study was to evaluate their appropriateness in terms of individual ability estimate and classification accuracy when manipulating both assessment and RG characteristics. The findings provide evidence that employing a common criterion for all examinees may be an ineffective strategy because a given RG percentage may have differing degrees of biasing effects based on test difficulty, examinee ability, and RG pattern. These results suggest that practitioners may benefit from establishing context-specific exclusion criteria that consider test purpose, score use, and targeted examinee trait levels.</p>","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"65 11","pages":""},"PeriodicalIF":1.5,"publicationDate":"2022-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138495008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}