{"title":"Detecting Local Dependence: A Threshold-Autoregressive Item Response Theory (TAR-IRT) Approach for Polytomous Items","authors":"Xiaodan Tang, G. Karabatsos, Haiqin Chen","doi":"10.1080/08957347.2020.1789136","DOIUrl":"https://doi.org/10.1080/08957347.2020.1789136","url":null,"abstract":"ABSTRACT In applications of item response theory (IRT) models, it is known that empirical violations of the local independence (LI) assumption can significantly bias parameter estimates. To address this issue, we propose a threshold-autoregressive item response theory (TAR-IRT) model that additionally accounts for order dependence among the item responses of each examinee. The TAR-IRT approach also defines a new family of IRT models for polytomous item responses under both unidimensional and multidimensional frameworks, with order-dependent effects between item responses and relevant dimensions. The feasibility of the proposed model was demonstrated by an empirical study using a polytomous response data. A simulation study for polytomous item responses with order effects of different magnitude in an education context shows that the TAR modeling framework could provide more accurate ability estimation than the partial credit model when order effect exists.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"280 - 292"},"PeriodicalIF":1.5,"publicationDate":"2020-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1789136","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42274266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aaron J. Myers, Allison J. Ames, B. Leventhal, Madison A. Holzman
{"title":"Validating Rubric Scoring Processes: An Application of an Item Response Tree Model","authors":"Aaron J. Myers, Allison J. Ames, B. Leventhal, Madison A. Holzman","doi":"10.1080/08957347.2020.1789143","DOIUrl":"https://doi.org/10.1080/08957347.2020.1789143","url":null,"abstract":"ABSTRACT When rating performance assessments, raters may ascribe different scores for the same performance when rubric application does not align with the intended application of the scoring criteria. Given performance assessment score interpretation assumes raters apply rubrics as rubric developers intended, misalignment between raters’ scoring processes and the intended scoring processes may lead to invalid inferences from these scores. In an effort to standardize raters’ scoring processes, an alternative scoring method was used. With this method, rubric developers’ intended scoring processes are made explicit by requiring raters to respond to a series of selected-response statements resembling a decision tree. To determine if raters scored essays as intended using a traditional rubric and the alternative scoring method, an IRT model with a tree-like structure (IRTree) was specified to depict the intended scoring processes and fit to data from each scoring method. Results suggest raters using the alternative method may be better able to rate as intended and thus the alternative method may be a viable alternative to traditional rubric scoring. Implications of the IRTree model are discussed.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"293 - 308"},"PeriodicalIF":1.5,"publicationDate":"2020-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1789143","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42133773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An IRT Mixture Model for Rating Scale Confusion Associated with Negatively Worded Items in Measures of Social-Emotional Learning","authors":"D. Bolt, Y. Wang, R. Meyer, L. Pier","doi":"10.1080/08957347.2020.1789140","DOIUrl":"https://doi.org/10.1080/08957347.2020.1789140","url":null,"abstract":"ABSTRACT We illustrate the application of mixture IRT models to evaluate respondent confusion due to the negative wording of certain items on a social-emotional learning (SEL) assessment. Using actual student self-report ratings on four social-emotional learning scales collected from students in grades 3–12 from CORE Districts in the state of California, we also evaluate the consequences of the potential confusion in biasing student- and school-level scores as well as the estimated correlational relationships between SEL constructs and student-level variables. Models of both full and partial confusion are examined. Our results suggest that (1) rating scale confusion due to negatively worded items does appear to be present; (2) the confusion is most prevalent at lower grade levels (third–fifth); and (3) the occurrence of confusion is positively related to both reading proficiency and ELL status, as anticipated, and consequently biases estimates of SEL correlations with these student-level variables. For these reasons, we suggest future iterations of the SEL measures use only positively oriented items.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"331 - 348"},"PeriodicalIF":1.5,"publicationDate":"2020-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1789140","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43253014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating Random and Systematic Error in Student Growth Percentiles","authors":"C. Wells, S. Sireci","doi":"10.1080/08957347.2020.1789139","DOIUrl":"https://doi.org/10.1080/08957347.2020.1789139","url":null,"abstract":"ABSTRACT Student growth percentiles (SGPs) are currently used by several states and school districts to provide information about individual students as well as to evaluate teachers, schools, and school districts. For SGPs to be defensible for these purposes, they should be reliable. In this study, we examine the amount of systematic and random error in SGPs by simulating test scores for four grades and estimating SGPs using one, two, or three conditioning years. The results indicated that, although the amount of systematic error was small to moderate, the amount of random error was substantial, regardless of the number of conditioning years. For example, the standard error of the SGP estimates associated with an SGP value of 56 was 22.2 resulting in a 68% confidence interval that would range from 33.8 to 78.2 when using three conditioning years. The results are consistent with previous research and suggest SGP estimates are too imprecise to be reported for the purpose of understanding students’ progress over time.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"349 - 361"},"PeriodicalIF":1.5,"publicationDate":"2020-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1789139","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43006041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Impact of Setting Scoring Expectations on Rater Scoring Rates and Accuracy","authors":"Cathy L. W. Wendler, Nancy Glazer, B. Bridgeman","doi":"10.1080/08957347.2020.1750401","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750401","url":null,"abstract":"ABSTRACT Efficient constructed response (CR) scoring requires both accuracy and speed from human raters. This study was designed to determine if setting scoring rate expectations would encourage raters to score at a faster pace, and if so, if there would be differential effects on scoring accuracy for raters who score at different rates. Three rater groups (slow, medium, and fast) and two conditions (informed and uninformed) were used. In both conditions, raters were given identical scoring directions, but only the informed groups were given an expected scoring rate. Results indicated no significant differences across the two conditions. However, there were significant increases in scoring rates for medium and slow raters compared to their previous operational rates, regardless of whether they were in the informed or uninformed condition. Results also showed there were no significant effects on rater accuracy for either of the two conditions or for any of the rater groups.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"248 - 254"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750401","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42360842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding and Interpreting Human Scoring","authors":"Nancy Glazer, E. Wolfe","doi":"10.1080/08957347.2020.1750402","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750402","url":null,"abstract":"ABSTRACT This introductory article describes how constructed response scoring is carried out, particularly the rater monitoring processes and illustrates three potential designs for conducting rater monitoring in an operational scoring project. The introduction also presents a framework for interpreting research conducted by those who study the constructed response scoring process. That framework identifies three classifications of inputs (rater characteristics, response content, and rating context) which typically serve as independent variables in constructed response scoring research as well as three primary outcomes (rating quality, rating speed, and rater attitude) which serve as the dependent variables in those studies. Finally, we explain how each of the articles in this issue can be classified according to that framework.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"191 - 197"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750402","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42557747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Why Should We Care about Human Raters?","authors":"E. Wolfe, Cathy L. W. Wendler","doi":"10.1080/08957347.2020.1750407","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750407","url":null,"abstract":"For more than a decade, measurement practitioners and researchers have emphasized evaluating, improving, and implementing automated scoring of constructed response (CR) items and tasks. There is go...","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"189 - 190"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750407","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46471978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Commentary on “Using Human Raters in Constructed Response Scoring: Understanding, Predicting, and Modifying Performance”","authors":"Walter D. Way","doi":"10.1080/08957347.2020.1750408","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750408","url":null,"abstract":"This special issue of AME provides a rich set of articles related to monitoring human scoring of constructed response items. As a starting point for this commentary, is it worth mentioning that the...","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"255 - 261"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750408","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41452462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating Human Scoring Using Generalizability Theory","authors":"Y. Bimpeh, W. Pointer, Ben A. Smith, Liz Harrison","doi":"10.1080/08957347.2020.1750403","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750403","url":null,"abstract":"ABSTRACT Many high-stakes examinations in the United Kingdom (UK) use both constructed-response items and selected-response items. We need to evaluate the inter-rater reliability for constructed-response items that are scored by humans. While there are a variety of methods for evaluating rater consistency across ratings in the psychometric literature, we apply generalizability theory (G theory) to data from routine monitoring of ratings to derive an estimate for inter-rater reliability. UK examinations use a combination of double or multiple rating for routine monitoring, creating a more complex design that consists of cross-pairing of raters and overlapping of raters for different groups of candidates or items. This sampling design is neither fully crossed nor is it nested. Each double- or multiple-scored item takes a different set of candidates, and the number of sampled candidates per item varies. Therefore, the standard G theory method, and its various forms for estimating inter-rater reliability, cannot be directly applied to the operational data. We propose a method that takes double or multiple rating data as given and analyzes the datasets at the item level in order to obtain more accurate and stable variance component estimates. We adapt the variance component in observed scores for an unbalanced one-facet crossed design with some missing observations. These estimates can be used to make inferences about the reliability of the entire scoring process. We illustrate the proposed method by applying it to real scoring data.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"198 - 209"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750403","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43345855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Impact of Operational Scoring Experience and Additional Mentored Training on Raters’ Essay Scoring Accuracy","authors":"Ikkyu Choi, E. Wolfe","doi":"10.1080/08957347.2020.1750404","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750404","url":null,"abstract":"ABSTRACT Rater training is essential in ensuring the quality of constructed response scoring. Most of the current knowledge about rater training comes from experimental contexts with an emphasis on short-term effects. Few sources are available for empirical evidence on whether and how raters become more accurate as they gain scoring experiences or what long-term effects training can have. In this study, we addressed this research gap by tracking how the accuracies of new raters change through experience and by examining the impact of an additional training session on their accuracies in scoring calibration and monitoring essays. We found that, on average, raters’ accuracy improved with scoring experience and that individual raters differed in their accuracy trajectories. The estimated average effect of the training was an approximately six percent increase in the calibration essay accuracy. On the other hand, we observed a smaller impact on the monitoring essay accuracy. Our follow-up analysis showed that this differential impact of the additional training on the calibration and monitoring essay accuracy could be accounted for by successful gatekeeping through calibration.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"210 - 222"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750404","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45226677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}