{"title":"Evaluating Random and Systematic Error in Student Growth Percentiles","authors":"C. Wells, S. Sireci","doi":"10.1080/08957347.2020.1789139","DOIUrl":"https://doi.org/10.1080/08957347.2020.1789139","url":null,"abstract":"ABSTRACT Student growth percentiles (SGPs) are currently used by several states and school districts to provide information about individual students as well as to evaluate teachers, schools, and school districts. For SGPs to be defensible for these purposes, they should be reliable. In this study, we examine the amount of systematic and random error in SGPs by simulating test scores for four grades and estimating SGPs using one, two, or three conditioning years. The results indicated that, although the amount of systematic error was small to moderate, the amount of random error was substantial, regardless of the number of conditioning years. For example, the standard error of the SGP estimates associated with an SGP value of 56 was 22.2 resulting in a 68% confidence interval that would range from 33.8 to 78.2 when using three conditioning years. The results are consistent with previous research and suggest SGP estimates are too imprecise to be reported for the purpose of understanding students’ progress over time.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2020-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1789139","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43006041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Impact of Setting Scoring Expectations on Rater Scoring Rates and Accuracy","authors":"Cathy L. W. Wendler, Nancy Glazer, B. Bridgeman","doi":"10.1080/08957347.2020.1750401","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750401","url":null,"abstract":"ABSTRACT Efficient constructed response (CR) scoring requires both accuracy and speed from human raters. This study was designed to determine if setting scoring rate expectations would encourage raters to score at a faster pace, and if so, if there would be differential effects on scoring accuracy for raters who score at different rates. Three rater groups (slow, medium, and fast) and two conditions (informed and uninformed) were used. In both conditions, raters were given identical scoring directions, but only the informed groups were given an expected scoring rate. Results indicated no significant differences across the two conditions. However, there were significant increases in scoring rates for medium and slow raters compared to their previous operational rates, regardless of whether they were in the informed or uninformed condition. Results also showed there were no significant effects on rater accuracy for either of the two conditions or for any of the rater groups.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750401","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42360842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding and Interpreting Human Scoring","authors":"Nancy Glazer, E. Wolfe","doi":"10.1080/08957347.2020.1750402","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750402","url":null,"abstract":"ABSTRACT This introductory article describes how constructed response scoring is carried out, particularly the rater monitoring processes and illustrates three potential designs for conducting rater monitoring in an operational scoring project. The introduction also presents a framework for interpreting research conducted by those who study the constructed response scoring process. That framework identifies three classifications of inputs (rater characteristics, response content, and rating context) which typically serve as independent variables in constructed response scoring research as well as three primary outcomes (rating quality, rating speed, and rater attitude) which serve as the dependent variables in those studies. Finally, we explain how each of the articles in this issue can be classified according to that framework.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750402","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42557747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Why Should We Care about Human Raters?","authors":"E. Wolfe, Cathy L. W. Wendler","doi":"10.1080/08957347.2020.1750407","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750407","url":null,"abstract":"For more than a decade, measurement practitioners and researchers have emphasized evaluating, improving, and implementing automated scoring of constructed response (CR) items and tasks. There is go...","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750407","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46471978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Commentary on “Using Human Raters in Constructed Response Scoring: Understanding, Predicting, and Modifying Performance”","authors":"Walter D. Way","doi":"10.1080/08957347.2020.1750408","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750408","url":null,"abstract":"This special issue of AME provides a rich set of articles related to monitoring human scoring of constructed response items. As a starting point for this commentary, is it worth mentioning that the...","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750408","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41452462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating Human Scoring Using Generalizability Theory","authors":"Y. Bimpeh, W. Pointer, Ben A. Smith, Liz Harrison","doi":"10.1080/08957347.2020.1750403","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750403","url":null,"abstract":"ABSTRACT Many high-stakes examinations in the United Kingdom (UK) use both constructed-response items and selected-response items. We need to evaluate the inter-rater reliability for constructed-response items that are scored by humans. While there are a variety of methods for evaluating rater consistency across ratings in the psychometric literature, we apply generalizability theory (G theory) to data from routine monitoring of ratings to derive an estimate for inter-rater reliability. UK examinations use a combination of double or multiple rating for routine monitoring, creating a more complex design that consists of cross-pairing of raters and overlapping of raters for different groups of candidates or items. This sampling design is neither fully crossed nor is it nested. Each double- or multiple-scored item takes a different set of candidates, and the number of sampled candidates per item varies. Therefore, the standard G theory method, and its various forms for estimating inter-rater reliability, cannot be directly applied to the operational data. We propose a method that takes double or multiple rating data as given and analyzes the datasets at the item level in order to obtain more accurate and stable variance component estimates. We adapt the variance component in observed scores for an unbalanced one-facet crossed design with some missing observations. These estimates can be used to make inferences about the reliability of the entire scoring process. We illustrate the proposed method by applying it to real scoring data.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750403","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43345855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Impact of Operational Scoring Experience and Additional Mentored Training on Raters’ Essay Scoring Accuracy","authors":"Ikkyu Choi, E. Wolfe","doi":"10.1080/08957347.2020.1750404","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750404","url":null,"abstract":"ABSTRACT Rater training is essential in ensuring the quality of constructed response scoring. Most of the current knowledge about rater training comes from experimental contexts with an emphasis on short-term effects. Few sources are available for empirical evidence on whether and how raters become more accurate as they gain scoring experiences or what long-term effects training can have. In this study, we addressed this research gap by tracking how the accuracies of new raters change through experience and by examining the impact of an additional training session on their accuracies in scoring calibration and monitoring essays. We found that, on average, raters’ accuracy improved with scoring experience and that individual raters differed in their accuracy trajectories. The estimated average effect of the training was an approximately six percent increase in the calibration essay accuracy. On the other hand, we observed a smaller impact on the monitoring essay accuracy. Our follow-up analysis showed that this differential impact of the additional training on the calibration and monitoring essay accuracy could be accounted for by successful gatekeeping through calibration.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750404","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45226677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predictive Modeling of Rater Behavior: Implications for Quality Assurance in Essay Scoring","authors":"I. Bejar, Chen Li, D. McCaffrey","doi":"10.1080/08957347.2020.1750406","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750406","url":null,"abstract":"ABSTRACT We evaluate the feasibility of developing predictive models of rater behavior, that is, rater-specific models for predicting the scores produced by a rater under operational conditions. In the present study, the dependent variable is the score assigned to essays by a rater, and the predictors are linguistic attributes of the essays used by the e-rater® engine. Specifically, for each rater, the linear regression of rater scores on the linguistic attributes is obtained based on data from two consecutive time periods. The regression from each period was cross validated against data from the other period. Raters were characterized in terms of their level of predictability and the importance of the predictors. Results suggest that rater models capture stable individual differences among raters. To evaluate the feasibility of using rater models as a quality control mechanism, we evaluated the relationship between rater predictability and inter-rater agreement and performance on pre-scored essays. Finally, we conducted a simulation whereby raters are simulated to score exclusively as a function of essay length at different points during the scoring day. We concluded that predictive rater models merit further investigation as a means of quality controlling human scoring.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750406","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45042742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Applying Cognitive Theory to the Human Essay Rating Process","authors":"B. Finn, Burcu Arslan, M. Walsh","doi":"10.1080/08957347.2020.1750405","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750405","url":null,"abstract":"ABSTRACT To score an essay response, raters draw on previously trained skills and knowledge about the underlying rubric and score criterion. Cognitive processes such as remembering, forgetting, and skill decay likely influence rater performance. To investigate how forgetting influences scoring, we evaluated raters’ scoring accuracy on TOEFL and GRE essays. We used binomial linear mixed effect models to evaluate how the effect of various predictors such as time spent scoring each response and days between scoring sessions relate to scoring accuracy. Results suggest that for both nonoperational (i.e., calibration samples completed prior to a scoring session) and operational scoring (i.e., validity samples interspersed among actual student responses), the number of days in a scoring gap negatively affects performance. The findings, as well as other results from the models are discussed in the context of cognitive influences on knowledge and skill retention.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750405","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46828655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gauging Uncertainty in Test-to-Curriculum Alignment Indices","authors":"A. Traynor, Tingxuan Li, Shuqi Zhou","doi":"10.1080/08957347.2020.1732387","DOIUrl":"https://doi.org/10.1080/08957347.2020.1732387","url":null,"abstract":"ABSTRACT During the development of large-scale school achievement tests, panels of independent subject-matter experts use systematic judgmental methods to rate the correspondence between a given test’s items and performance objective statements. The individual experts’ ratings may then be used to compute summary indices to quantify the match between a given test and its target item domain. The magnitude of alignment index variability across experts within a panel, and randomly-sampled panels, is largely unknown, however. Using rater-by-item data from alignment reviews of 14 US states’ achievement tests, we examine observed distributions and estimate standard errors for three alignment indices developed by Webb. Our results suggest that alignment decisions based on the recommended criterion for the balance-of-representation index may often be uncertain, and that the criterion for the depth-of-knowledge consistency index should perhaps be reconsidered. We also examine current recommendations about the number of expert panelists required to compute these alignment indices.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2020-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1732387","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49412039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}