{"title":"Validity and Racial Justice in Educational Assessment","authors":"Josh Lederman","doi":"10.1080/08957347.2023.2214654","DOIUrl":"https://doi.org/10.1080/08957347.2023.2214654","url":null,"abstract":"Abstract Given its centrality to assessment, until the concept of validity includes concern for racial justice, such matters will be seen as residing outside the “real” work of validation, rendering them powerless to count against the apparent scientific merit of the test. As the definition of validity has evolved, however, it holds great potential to centralize matters like racial (in)justice, positioning them as necessary validity evidence. This article reviews a history of debates over what validity should and shouldn’t encompass; we then look toward the more centralized stances on validity – the book series Standards and Educational Measurement – where we see that test use, and the social impact of test use, has been a mounting concern over the years within these publications. Finally, we explore Kane’s argument-based approach to validation, which I argue could impact racial justice concerns by centralizing them within the very notion of what makes assessment valid or invalid.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2023-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41535738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"College Admissions and Testing in a Time of Transformational Change","authors":"Ross E. Markle","doi":"10.1080/08957347.2023.2201705","DOIUrl":"https://doi.org/10.1080/08957347.2023.2201705","url":null,"abstract":"conversation","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42510663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sarah Alahmadi, Andrew T. Jones, Carol L. Barry, Beatriz Ibáñez
{"title":"Comparing Drift Detection Methods for Accurate Rasch Equating in Different Sample Sizes","authors":"Sarah Alahmadi, Andrew T. Jones, Carol L. Barry, Beatriz Ibáñez","doi":"10.1080/08957347.2023.2201704","DOIUrl":"https://doi.org/10.1080/08957347.2023.2201704","url":null,"abstract":"ABSTRACT Rasch common-item equating is often used in high-stakes testing to maintain equivalent passing standards across test administrations. If unaddressed, item parameter drift poses a major threat to the accuracy of Rasch common-item equating. We compared the performance of well-established and newly developed drift detection methods in small and large sample sizes, varying the proportion of test items used as anchor (common) items and the proportion of drifted anchors. In the simulated-data study, the most accurate equating was obtained in large-sample conditions with a small-moderate number of drifted anchors using the mINFIT/mOUTFIT methods. However, when any drift was present in small-sample conditions and when a large number of drifted anchors were present in large-sample conditions, all methods performed ineffectively. In the operational-data study, percent-correct standards and failure rates varied across the methods in the large-sample exam but not in the small-sample exam. Different recommendations for high- and low-volume testing programs are provided.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42571395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keeping Up the PACE: Evaluating Grade 8 Student Achievement Outcomes for New Hampshire’s Innovative Assessment System","authors":"Alexandra Lane Perez, Carla M. Evans","doi":"10.1080/08957347.2023.2201700","DOIUrl":"https://doi.org/10.1080/08957347.2023.2201700","url":null,"abstract":"ABSTRACT New Hampshire’s Performance Assessment of Competency Education (PACE) innovative assessment system uses student scores from classroom performance assessments as well as other classroom tests for school accountability purposes. One concern is that not having annual state testing may incentivize schools and teachers away from teaching the breadth of the state content standards. This study examined the effects of PACE on Grade 8 test scores after 5 years of implementation using propensity score matching followed by hierarchical linear modeling. The results suggest that PACE students perform about the same, on average, in mathematics and ELA as non-PACE students on the state assessment. There was no evidence of differential effects for students who had an individualized education program or were granted FRL. Findings for this limited sample suggest schools and teachers did not sacrifice the breadth of students’ opportunity to learn the state content standards while piloting a state performance assessment reform.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48459890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Group Generalizations of SIBTEST and Crossing-SIBTEST","authors":"R. P. Chalmers, Guoguo Zheng","doi":"10.1080/08957347.2023.2201703","DOIUrl":"https://doi.org/10.1080/08957347.2023.2201703","url":null,"abstract":"ABSTRACT This article presents generalizations of SIBTEST and crossing-SIBTEST statistics for differential item functioning (DIF) investigations involving more than two groups. After reviewing the original two-group setup for these statistics, a set of multigroup generalizations that support contrast matrices for joint tests of DIF are presented. To investigate the Type I error and power behavior of these generalizations, a Monte Carlo simulation study was then explored. Results indicated that the proposed generalizations are reasonably effective at recovering their respective population parameter definitions, maintain optimal Type I error control, have suitable power to detect uniform and non-uniform DIF, and in shorter tests are competitive with the generalized logistic regression and generalized Mantel–Haenszel tests for DIF.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44337447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tracking Ordinal Development of Skills with a Longitudinal DINA Model with Polytomous Attributes","authors":"P. Zhan, Yao-sen Liu, Zhaohui Yu, Yanfang Pan","doi":"10.1080/08957347.2023.2201702","DOIUrl":"https://doi.org/10.1080/08957347.2023.2201702","url":null,"abstract":"ABSTRACT Many educational and psychological studies have shown that the development of students is generally step-by-step (i.e. ordinal development) to a specific level. This study proposed a novel longitudinal learning diagnosis model with polytomous attributes to track students’ ordinal development in learning. Using the concept of polytomous attributes in the proposed model, the learning process of a specific skill, from non-mastery to mastery, can be divided into multiple ordinal steps in order to better characterize the learning trajectory. The results of an empirical study conducted to explore the performance of the proposed model indicated that it could adequately diagnose the ordinal development of skills in longitudinal assessments. A simulation study was also conducted to examine the estimation accuracy of general ability and the classification accuracy of attributes of the proposed model in different simulated conditions.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47235911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Visser, Friederike Cartschau, Ariane von Goldammer, Janin Brandenburg, M. Timmerman, M. Hasselhorn, C. Mähler
{"title":"Measurement Invariance in Relation to First Language: An Evaluation of German Reading and Spelling Tests","authors":"L. Visser, Friederike Cartschau, Ariane von Goldammer, Janin Brandenburg, M. Timmerman, M. Hasselhorn, C. Mähler","doi":"10.1080/08957347.2023.2201701","DOIUrl":"https://doi.org/10.1080/08957347.2023.2201701","url":null,"abstract":"ABSTRACT The growing number of children in primary schools in Germany who have German as their second language (L2) has raised questions about the fairness of performance assessment. Fair tests are a prerequisite for distinguishing between L2 learning delay and a specific learning disability. We evaluated five commonly used reading and spelling tests for measurement invariance (MI) as a function of first language (German vs. other). Multi-group confirmatory factor analyses revealed strict MI for the Weingarten Basic Vocabulary Spelling Tests (WRTs) 3+ and 4+ and the Salzburger Reading (SLT) and Spelling (SRT) Tests, suggesting these instruments are suitable for assessing reading and spelling skills regardless of first language. The MI for A Reading Comprehension Test for First to Seventh Graders – 2nd Edition (ELFE II) was partly strict with unequal intercepts for the text subscale. We discuss the implications of this finding for assessing reading performance of children with L2.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"59806259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Census-Level, Multi-Grade Analysis of the Association Between Testing Time, Breaks, and Achievement","authors":"david. rutkowski, Leslie Rutkowski, Dubravka Svetina Valdivia, Yusuf Canbolat, Stephanie Underhill","doi":"10.1080/08957347.2023.2172019","DOIUrl":"https://doi.org/10.1080/08957347.2023.2172019","url":null,"abstract":"ABSTRACT Several states in the US have removed time limits on their state assessments. In Indiana, where this study takes place, the state assessment is both untimed during the testing window and allows unlimited breaks during the testing session. Using grade 3 and 8 math and English state assessment data, in this paper we focus on time used for testing and examine whether students who take more time tend to outperform their peers. Further, we also examine if the number of breaks students take is associated with student achievement scores. Findings suggest that even in an untimed setting, there remains a strong association between time spent on the assessment and achievement at both the student and school level. The number of breaks, on the other hand, show little to no association with achievement after controlling for time. The paper concludes with a discussion of the policy implications of the findings.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41397365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Maintaining Score Scales Over Time: A Comparison of Five Scoring Methods","authors":"S. Y. Kim, Won‐Chan Lee","doi":"10.1080/08957347.2023.2172015","DOIUrl":"https://doi.org/10.1080/08957347.2023.2172015","url":null,"abstract":"ABSTRACT This study evaluates various scoring methods including number-correct scoring, IRT theta scoring, and hybrid scoring in terms of scale-score stability over time. A simulation study was conducted to examine the relative performance of five scoring methods in terms of preserving the first two moments of scale scores for a population in a chain of linking with multiple test forms. Simulation factors included 1) the number of forms linked back to the initial form, 2) the pattern in mean shift, and 3) the proportion of common items. Results showed that scoring methods that operate with number-correct scores generally outperform those that are based on IRT proficiency estimators ( ) in terms of reproducing the mean and standard deviation of scale scores. Scoring methods performed differently as a function of patterns in a group proficiency change.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46970807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accuracy and Sensitivity of Coefficient Alpha and Its Alternatives with Unidimensional and Contaminated Scales","authors":"Leifeng Xiao, K. Hau","doi":"10.1080/08957347.2023.2172016","DOIUrl":"https://doi.org/10.1080/08957347.2023.2172016","url":null,"abstract":"ABSTRACT We compared coefficient alpha with five alternatives (omega total, omega RT, omega h, GLB, and coefficient H) in two simulation studies. Results showed for unidimensional scales, (a) all indices except omega h performed similarly well for most conditions; (b) alpha is still good; (c) GLB and coefficient H overestimated reliability with small samples and short scales, and (d) sensitivity to scale quality reduced with longer scales. For contaminated scales, (a) all indices except omega h were reasonably unbiased with non-severe contamination; (b) alpha, omega total, and GLB were more sensitive in picking up contamination with shorter scales, whereas omega RT and omega h were not; and (c) coefficient H could not pick up contaminated items among high-quality items. For applied researchers, (a) supplementary information of scale characteristics helps choose the appropriate index; (b) comparing different scales with one golden standard is inappropriate; (c) omega h should not be used alone.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48520089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}