{"title":"Predictive Modeling of Rater Behavior: Implications for Quality Assurance in Essay Scoring","authors":"I. Bejar, Chen Li, D. McCaffrey","doi":"10.1080/08957347.2020.1750406","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750406","url":null,"abstract":"ABSTRACT We evaluate the feasibility of developing predictive models of rater behavior, that is, rater-specific models for predicting the scores produced by a rater under operational conditions. In the present study, the dependent variable is the score assigned to essays by a rater, and the predictors are linguistic attributes of the essays used by the e-rater® engine. Specifically, for each rater, the linear regression of rater scores on the linguistic attributes is obtained based on data from two consecutive time periods. The regression from each period was cross validated against data from the other period. Raters were characterized in terms of their level of predictability and the importance of the predictors. Results suggest that rater models capture stable individual differences among raters. To evaluate the feasibility of using rater models as a quality control mechanism, we evaluated the relationship between rater predictability and inter-rater agreement and performance on pre-scored essays. Finally, we conducted a simulation whereby raters are simulated to score exclusively as a function of essay length at different points during the scoring day. We concluded that predictive rater models merit further investigation as a means of quality controlling human scoring.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"234 - 247"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750406","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45042742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Applying Cognitive Theory to the Human Essay Rating Process","authors":"B. Finn, Burcu Arslan, M. Walsh","doi":"10.1080/08957347.2020.1750405","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750405","url":null,"abstract":"ABSTRACT To score an essay response, raters draw on previously trained skills and knowledge about the underlying rubric and score criterion. Cognitive processes such as remembering, forgetting, and skill decay likely influence rater performance. To investigate how forgetting influences scoring, we evaluated raters’ scoring accuracy on TOEFL and GRE essays. We used binomial linear mixed effect models to evaluate how the effect of various predictors such as time spent scoring each response and days between scoring sessions relate to scoring accuracy. Results suggest that for both nonoperational (i.e., calibration samples completed prior to a scoring session) and operational scoring (i.e., validity samples interspersed among actual student responses), the number of days in a scoring gap negatively affects performance. The findings, as well as other results from the models are discussed in the context of cognitive influences on knowledge and skill retention.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"223 - 233"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750405","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46828655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gauging Uncertainty in Test-to-Curriculum Alignment Indices","authors":"A. Traynor, Tingxuan Li, Shuqi Zhou","doi":"10.1080/08957347.2020.1732387","DOIUrl":"https://doi.org/10.1080/08957347.2020.1732387","url":null,"abstract":"ABSTRACT During the development of large-scale school achievement tests, panels of independent subject-matter experts use systematic judgmental methods to rate the correspondence between a given test’s items and performance objective statements. The individual experts’ ratings may then be used to compute summary indices to quantify the match between a given test and its target item domain. The magnitude of alignment index variability across experts within a panel, and randomly-sampled panels, is largely unknown, however. Using rater-by-item data from alignment reviews of 14 US states’ achievement tests, we examine observed distributions and estimate standard errors for three alignment indices developed by Webb. Our results suggest that alignment decisions based on the recommended criterion for the balance-of-representation index may often be uncertain, and that the criterion for the depth-of-knowledge consistency index should perhaps be reconsidered. We also examine current recommendations about the number of expert panelists required to compute these alignment indices.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"141 - 158"},"PeriodicalIF":1.5,"publicationDate":"2020-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1732387","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49412039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Impact of Test-Taking Disengagement on Item Content Representation","authors":"S. Wise","doi":"10.1080/08957347.2020.1732386","DOIUrl":"https://doi.org/10.1080/08957347.2020.1732386","url":null,"abstract":"ABSTRACT In achievement testing there is typically a practical requirement that the set of items administered should be representative of some target content domain. This is accomplished by establishing test blueprints specifying the content constraints to be followed when selecting the items for a test. Sometimes, however, students give disengaged responses to some of their test items, which raises the issue of the degree to which the set of engaged responses maintain the intended content representation. The current investigation reports the results of two studies focused on rapid-guessing behavior. The first study showed evidence that differential rapid guessing often resulted in test events with meaningfully distorted content representation. The second study found that the differences in test taking engagement across content categories were primarily due to differences in the reading load of items. Implications for test-score validity are discussed along with suggestions for addressing the problem.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"83 - 94"},"PeriodicalIF":1.5,"publicationDate":"2020-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1732386","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41488342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Trade-Off between Model Fit, Invariance, and Validity: The Case of PISA Science Assessments","authors":"Yasmine H. El Masri, D. Andrich","doi":"10.1080/08957347.2020.1732384","DOIUrl":"https://doi.org/10.1080/08957347.2020.1732384","url":null,"abstract":"ABSTRACT In large-scale educational assessments, it is generally required that tests are composed of items that function invariantly across the groups to be compared. Despite efforts to ensure invariance in the item construction phase, for a range of reasons (including the security of items) it is often necessary to account for differential item functioning (DIF) of items post hoc. This typically requires a choice among retaining an item as it is despite its DIF, deleting the item, or resolving (splitting) an item by creating a distinct item for each group. These options involve a trade-off between model fit and the invariance of item parameters, and each option could be valid depending on whether or not the source of DIF is relevant or irrelevant to the variable being assessed. We argue that making a choice requires a careful analysis of statistical DIF and its substantive source. We illustrate our argument by analyzing PISA 2006 science data of three countries (UK, France and Jordan) using the Rasch model, which was the model used for the analyses of all PISA 2006 data. We identify items with real DIF across countries and examine the implications for model fit, invariance, and the validity of cross-country comparisons when these items are either eliminated, resolved or retained.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"174 - 188"},"PeriodicalIF":1.5,"publicationDate":"2020-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1732384","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43277464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparing Cut Scores from the Angoff Method and Two Variations of the Hofstee and Beuk Methods","authors":"Adam E. Wyse","doi":"10.1080/08957347.2020.1732385","DOIUrl":"https://doi.org/10.1080/08957347.2020.1732385","url":null,"abstract":"ABSTRACT This article compares cut scores from two variations of the Hofstee and Beuk methods, which determine cut scores by resolving inconsistencies in panelists’ judgments about cut scores and pass rates, with the Angoff method. The first variation uses responses to the Hofstee and Beuk percentage correct and pass rate questions to calculate cut scores. The second variation uses Angoff ratings to determine percentage correct data in combination with responses to pass rate questions. Analysis of data from 15 standard settings suggested that the Hofstee and Beuk methods yielded similar cut scores, and that cut scores were about 2% lower when using Angoff ratings. The two approaches also differed in the weight assigned to cut score judgments in the Beuk method and in the occurrence of undefined cut scores in the Hofstee method. Findings also indicated that the Hofstee and Beuk methods often produced higher cut scores and lower pass rates than the Angoff method. It is suggested that attention needs to be paid to the strategy used to estimate Hofstee and Beuk cut scores.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"159 - 173"},"PeriodicalIF":1.5,"publicationDate":"2020-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1732385","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49228294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rasch Model Extensions for Enhanced Formative Assessments in MOOCs","authors":"D. Abbakumov, P. Desmet, W. Van den Noortgate","doi":"10.1080/08957347.2020.1732382","DOIUrl":"https://doi.org/10.1080/08957347.2020.1732382","url":null,"abstract":"ABSTRACT Formative assessments are an important component of massive open online courses (MOOCs), online courses with open access and unlimited student participation. Accurate conclusions on students’ proficiency via formative, however, face several challenges: (a) students are typically allowed to make several attempts; and (b) student performance might be affected by other variables, such as interest. Thus, neglecting the effects of attempts and interest in proficiency evaluation might result in biased conclusions. In this study, we try to solve this limitation and propose two extensions of the common psychometric model, the Rasch model, by including the effects of attempts and interest. We illustrate these extensions using real MOOC data and evaluate them using cross-validation. We found that (a) the effects of attempts and interest on the performance are positive on average but both vary among students; (b) a part of the variance in proficiency parameters is due to variation between students in the effect of interest; and (c) the overall accuracy of prediction of student’s item responses using the extensions is 4.3% higher than using the Rasch model.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"113 - 123"},"PeriodicalIF":1.5,"publicationDate":"2020-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1732382","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44842343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Subscore Equating and Profile Reporting","authors":"Euijin Lim, Won‐Chan Lee","doi":"10.1080/08957347.2020.1732381","DOIUrl":"https://doi.org/10.1080/08957347.2020.1732381","url":null,"abstract":"ABSTRACT The purpose of this study is to address the necessity of subscore equating and to evaluate the performance of various equating methods for subtests. Assuming the random groups design and number-correct scoring, this paper analyzed real data and simulated data with four study factors including test dimensionality, subtest length, form difference in difficulty, and sample size. The results indicated that reporting subscores without equating provides misleading information in terms of score profiles and that reporting subscores without a pre-specified test specification brings practical issues such as constructing alternate subtest forms with comparable difficulty, conducting equating between forms with different lengths, and deciding an appropriate score scale to be reported.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"95 - 112"},"PeriodicalIF":1.5,"publicationDate":"2020-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1732381","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47621779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hansol Lee, Huy Q. Chung, Yu Zhang, J. Abedi, M. Warschauer
{"title":"The Effectiveness and Features of Formative Assessment in US K-12 Education: A Systematic Review","authors":"Hansol Lee, Huy Q. Chung, Yu Zhang, J. Abedi, M. Warschauer","doi":"10.1080/08957347.2020.1732383","DOIUrl":"https://doi.org/10.1080/08957347.2020.1732383","url":null,"abstract":"ABSTRACT In the present article, we present a systematical review of previous empirical studies that conducted formative assessment interventions to improve student learning. Previous meta-analysis research on the overall effects of formative assessment on student learning has been conclusive, but little has been studied on important features of formative assessment interventions and their differential impacts on student learning in the United States’ K-12 education system. Analysis of the identified 126 effect sizes from the selected 33 studies representing 25 research projects that met the inclusion criteria (e.g., included a control condition) revealed an overall small-sized positive effect of formative assessment on student learning (d = .29) with benefits for mathematics (d = .34), literacy (d = .33), and arts (d = .29). Further investigation with meta-regression analyses indicated that supporting student-initiated self-assessment (d = .61) and providing formal formative assessment evidence (e.g., written feedback on quizzes; d = .40) via a medium-cycle length (within or between instructional units; d = .52) were found to enhance the effectiveness of formative assessments.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"124 - 140"},"PeriodicalIF":1.5,"publicationDate":"2020-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1732383","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42432168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Some Methods and Evaluation for Linking and Equating with Small Samples","authors":"Michael R. Peabody","doi":"10.1080/08957347.2019.1674304","DOIUrl":"https://doi.org/10.1080/08957347.2019.1674304","url":null,"abstract":"ABSTRACT The purpose of the current article is to introduce the equating and evaluation methods used in this special issue. Although a comprehensive review of all existing models and methodologies would be impractical given the format, a brief introduction to some of the more popular models will be provided. A brief discussion of the conditions required for equating precedes the discussion of the equating methods themselves. The procedures in this review include the Tucker method, mean equating, nominal weights mean, simplified circle arc, identity equating, and IRT/Rasch model equating. Models shown that help to evaluate the success of the equating process are the standard error of equating, bias, and root-mean-square error. This should provide readers with a basic framework and enough background information to follow the studies found in this issue.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"3 - 9"},"PeriodicalIF":1.5,"publicationDate":"2020-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2019.1674304","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48203381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}