W. Jake Thompson, Brooke Nash, Amy K. Clark, Jeffrey C. Hoover
{"title":"Using Simulated Retests to Estimate the Reliability of Diagnostic Assessment Systems","authors":"W. Jake Thompson, Brooke Nash, Amy K. Clark, Jeffrey C. Hoover","doi":"10.1111/jedm.12359","DOIUrl":"10.1111/jedm.12359","url":null,"abstract":"<p>As diagnostic classification models become more widely used in large-scale operational assessments, we must give consideration to the methods for estimating and reporting reliability. Researchers must explore alternatives to traditional reliability methods that are consistent with the design, scoring, and reporting levels of diagnostic assessment systems. In this article, we describe and evaluate a method for simulating retests to summarize reliability evidence at multiple reporting levels. We evaluate how the performance of reliability estimates from simulated retests compares to other measures of classification consistency and accuracy for diagnostic assessments that have previously been described in the literature, but which limit the level at which reliability can be reported. Overall, the findings show that reliability estimates from simulated retests are an accurate measure of reliability and are consistent with other measures of reliability for diagnostic assessments. We then apply this method to real data from the Examination for the Certificate of Proficiency in English to demonstrate the method in practice and compare reliability estimates from observed data. Finally, we discuss implications for the field and possible next directions.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"455-475"},"PeriodicalIF":1.3,"publicationDate":"2023-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47801652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Katherine E. Castellano, Daniel F. McCaffrey, J. R. Lockwood
{"title":"An Exploration of an Improved Aggregate Student Growth Measure Using Data from Two States","authors":"Katherine E. Castellano, Daniel F. McCaffrey, J. R. Lockwood","doi":"10.1111/jedm.12354","DOIUrl":"10.1111/jedm.12354","url":null,"abstract":"<p>The simple average of student growth scores is often used in accountability systems, but it can be problematic for decision making. When computed using a small/moderate number of students, it can be sensitive to the sample, resulting in inaccurate representations of growth of the students, low year-to-year stability, and inequities for low-incidence groups. An alternative designed to address these issues is to use an Empirical Best Linear Prediction (EBLP), which is a weighted average of growth score data from other years and/or subjects. We apply both approaches to two statewide datasets to answer empirical questions about their performance. The EBLP outperforms the simple average in accuracy and cross-year stability with the exception that accuracy was not necessarily improved for very large districts in one of the states. In such exceptions, we show a beneficial alternative may be to use a hybrid approach in which very large districts receive the simple average and all others receive the EBLP. We find that adding more growth score data to the computation of the EBLP can improve accuracy, but not necessarily for larger schools/districts. We review key decision points in aggregate growth reporting and in specifying an EBLP weighted average in practice.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 2","pages":"173-201"},"PeriodicalIF":1.3,"publicationDate":"2023-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41556588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Classification Accuracy and Consistency of Compensatory Composite Test Scores","authors":"J. Carl Setzer, Ying Cheng, Cheng Liu","doi":"10.1111/jedm.12357","DOIUrl":"10.1111/jedm.12357","url":null,"abstract":"<p>Test scores are often used to make decisions about examinees, such as in licensure and certification testing, as well as in many educational contexts. In some cases, these decisions are based upon compensatory scores, such as those from multiple sections or components of an exam. Classification accuracy and classification consistency are two psychometric characteristics of test scores that are often reported when decisions are based on those scores, and several techniques currently exist for estimating both accuracy and consistency. However, research on classification accuracy and consistency on compensatory test scores is scarce. This study demonstrates two techniques that can be used to estimate classification accuracy and consistency when test scores are used in a compensatory manner. First, a simulation study demonstrates that both methods provide very similar results under the studied conditions. Second, we demonstrate how the two methods could be used with a high-stakes licensure exam.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"501-519"},"PeriodicalIF":1.3,"publicationDate":"2023-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46918652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Specifying the Three Ws in Educational Measurement: Who Uses Which Scores for What Purpose?","authors":"Andrew Ho","doi":"10.1111/jedm.12355","DOIUrl":"10.1111/jedm.12355","url":null,"abstract":"<p>I argue that understanding and improving educational measurement requires specificity about actors, scores, and purpose: Who uses which scores for what purpose? I show how this specificity complements Briggs’ frameworks for educational measurement that he presented in his 2022 address as president of the National Council on Measurement in Education.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"59 4","pages":"418-422"},"PeriodicalIF":1.3,"publicationDate":"2022-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48187875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Calibration in Multidimensional Computerized Adaptive Testing with Polytomously Scored Items","authors":"Lu Yuan, Yingshi Huang, Shuhang Li, Ping Chen","doi":"10.1111/jedm.12353","DOIUrl":"10.1111/jedm.12353","url":null,"abstract":"<p>Online calibration is a key technology for item calibration in computerized adaptive testing (CAT) and has been widely used in various forms of CAT, including unidimensional CAT, multidimensional CAT (MCAT), CAT with polytomously scored items, and cognitive diagnostic CAT. However, as multidimensional and polytomous assessment data become more common, only a few published reports focus on online calibration in MCAT with polytomously scored items (P-MCAT). Therefore, standing on the shoulders of the existing online calibration methods/designs, this study proposes four new P-MCAT online calibration methods and two new P-MCAT online calibration designs and conducts two simulation studies to evaluate their performance under varying conditions (i.e., different calibration sample sizes and correlations between dimensions). Results show that all of the newly proposed methods can accurately recover item parameters, and the adaptive designs outperform the random design in most cases. In the end, this paper provides practical guidance based on simulation results.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 3","pages":"476-500"},"PeriodicalIF":1.3,"publicationDate":"2022-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47208290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measuring the Uncertainty of Imputed Scores","authors":"Sandip Sinharay","doi":"10.1111/jedm.12352","DOIUrl":"10.1111/jedm.12352","url":null,"abstract":"<p>Technical difficulties and other unforeseen events occasionally lead to incomplete data on educational tests, which necessitates the reporting of imputed scores to some examinees. While there exist several approaches for reporting imputed scores, there is a lack of any guidance on the reporting of the uncertainty of imputed scores. In this paper, several approaches are suggested for quantifying the uncertainty of imputed scores using measures that are similar in spirit to estimates of reliability and standard error of measurement. A simulation study is performed to examine the properties of the approaches. The approaches are then applied to data from a state test on which some examinees' scores had to be imputed following computer problems. Several recommendations are made for practice.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 2","pages":"351-375"},"PeriodicalIF":1.3,"publicationDate":"2022-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45116305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Exponentially Weighted Moving Average Procedure for Detecting Back Random Responding Behavior","authors":"Yinhong He","doi":"10.1111/jedm.12351","DOIUrl":"10.1111/jedm.12351","url":null,"abstract":"<p>Back random responding (BRR) behavior is one of the commonly observed careless response behaviors. Accurately detecting BRR behavior can improve test validities. Yu and Cheng (2019) showed that the change point analysis (CPA) procedure based on weighted residual (CPA-WR) performed well in detecting BRR. Compared with the CPA procedure, the exponentially weighted moving average (EWMA) obtains more detailed information. This study equipped the weighted residual statistic with EWMA, and proposed the EWMA-WR method to detect BRR. To make the critical values adaptive to the ability levels, this study proposed the Monte Carlo simulation with ability stratification (MC-stratification) method for calculating critical values. Compared to the original Monte Carlo simulation (MC) method, the newly proposed MC-stratification method generated a larger number of satisfactory results. The performances of CPA-WR and EWMA-WR were evaluated under different conditions that varied in the test lengths, abnormal proportions, critical values and smoothing constants used in the EWMA-WR method. The results showed that EWMA-WR was more powerful than CPA-WR in detecting BRR. Moreover, an empirical study was conducted to illustrate the utility of EWMA-WR for detecting BRR.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 2","pages":"282-317"},"PeriodicalIF":1.3,"publicationDate":"2022-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47390314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple-Group Joint Modeling of Item Responses, Response Times, and Action Counts with the Conway-Maxwell-Poisson Distribution","authors":"Xin Qiao, Hong Jiao, Qiwei He","doi":"10.1111/jedm.12349","DOIUrl":"10.1111/jedm.12349","url":null,"abstract":"<p>Multiple group modeling is one of the methods to address the measurement noninvariance issue. Traditional studies on multiple group modeling have mainly focused on item responses. In computer-based assessments, joint modeling of response times and action counts with item responses helps estimate the latent speed and action levels in addition to latent ability. These two new data sources can also be used to further address the measurement noninvariance issue. One challenge, however, is to correctly model action counts which can be underdispersed, overdispersed, or equidispersed in real data sets. To address this, we adopted the Conway-Maxwell-Poisson distribution that accounts for different types of dispersion in action counts and incorporated it in the multiple group joint modeling of item responses, response times, and action counts. Bayesian Markov Chain Monte Carlo method was used for model parameter estimation. To illustrate an application of the proposed model, an empirical data analysis was conducted using the Programme for International Student Assessment (PISA) 2015 collaborative problem-solving items where potential measurement noninvariance issue existed between gender groups. Results indicated that Conway-Maxwell-Poisson model yielded better model fit than alternative count data models such as negative binomial and Poisson models. In addition, response times and action counts provided further information on performance differences between groups.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"60 2","pages":"255-281"},"PeriodicalIF":1.3,"publicationDate":"2022-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45484845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NCME Presidential Address 2022: Turning the Page to the Next Chapter of Educational Measurement","authors":"Derek C. Briggs","doi":"10.1111/jedm.12350","DOIUrl":"https://doi.org/10.1111/jedm.12350","url":null,"abstract":"","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"59 4","pages":"398-417"},"PeriodicalIF":1.3,"publicationDate":"2022-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"137813868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}