{"title":"When Is Classroom Assessment Educational Measurement?","authors":"Susan M. Brookhart, Sarah M. Bonner","doi":"10.1111/emip.70021","DOIUrl":"https://doi.org/10.1111/emip.70021","url":null,"abstract":"<p>The relationship between classroom assessment and educational measurement has been under discussion for some time. This article uses the TISM framework (<i>Theory, Instrumentation, Scales and units</i>, and <i>Modeling</i>) to clarify which aspects of classroom assessment are educational measurement (e.g., a grade on a performance assessment keyed to a learning standard) and which are not (e.g., extended elaborated feedback on that same assessment). We conclude that classroom assessment which produces ordinal or interval-level quantitative scores—by whatever name they are called, including scores, grades, and performance levels—is educational measurement because it implicates theory, instrumentation, scales and units, and modeling of error. On this basis, we claim that work in classroom assessment and educational measurement can and should be mutually informative.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 2","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147715036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander Kwako, Susan Lottridge, Christopher Ormerod
{"title":"Using Confidence Modeling to Optimize Overall Score Quality in Hybrid Scoring Systems","authors":"Alexander Kwako, Susan Lottridge, Christopher Ormerod","doi":"10.1111/emip.70019","DOIUrl":"https://doi.org/10.1111/emip.70019","url":null,"abstract":"<p>In large-scale assessments, constructed response items are often scored using hybrid scoring systems, which combine human and automated scores. In this study, we augment automated scoring with confidence modeling to strategically route difficult-to-score responses for human review. We utilize <i>hybrid performance curves</i> to visualize the impact of routing on performance. Additionally, we propose several <i>hybrid scoring policies</i> for selecting optimal routing thresholds given practical constraints. Our findings reveal that hybrid scoring systems can achieve an overall performance that exceeds that of human- and automated-only systems. Moreover, the superior performance of the hybrid system is less expensive than a human-only system. These findings highlight the complementarity of human raters and automated scoring engines. Although current standards focus on the performance of human raters and automated scoring engines in isolation, we recommend that practitioners also report on the performance of the hybrid scoring system as a whole.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 2","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.70019","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147714992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Farshad Effatpanah, Olga Kunina-Habenicht, Steve Bernard, Caroline Hornung, Philipp Sonnleitner
{"title":"Optimizing Large-Scale Mathematical Assessments: Leveraging Hierarchical Attribute Structures and Diagnostic Classification Models for Enhanced Student Diagnostics","authors":"Farshad Effatpanah, Olga Kunina-Habenicht, Steve Bernard, Caroline Hornung, Philipp Sonnleitner","doi":"10.1111/emip.70016","DOIUrl":"https://doi.org/10.1111/emip.70016","url":null,"abstract":"<p>Diagnostic classification models (DCMs) assess students’ mastery of cognitive attributes to provide personalized ability profiles. Retrofitting DCMs to large-scale mathematics assessments usually relies on inferred Q-matrices, which can reduce accuracy and diagnostic value. This study evaluated whether constructing items from cognitive models—yielding Q-matrices directly—and incorporating hierarchical relationships among attributes improve diagnostic outcomes. Responses from 5,336 third-grade students to a Luxembourgish image-based, large-scale standardized mathematics exam were analyzed using multiple DCMs and their hierarchical extensions. Items were constructed based on a Q-matrix, derived from the curriculum and cognitive models. The hierarchical A-CDM outperformed other models, classifying students into 60 latent classes with acceptable attribute- and test-level accuracy and more interpretable results than the G-DINA model. Using cognitive model-based item generation and Q-matrices as well as specifying attribute hierarchies enhance the accuracy and interpretability of DCM-based diagnostics in large-scale assessments, complementing traditional psychometric approaches by discerning meaningful within-score differences.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 2","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.70016","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147566730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating Rater Effects of Large Language Models in Automated Essay Scoring: GPT, Claude, Gemini, and DeepSeek","authors":"Hong Jiao, Dan Song, Won-Chan Lee","doi":"10.1111/emip.70018","DOIUrl":"https://doi.org/10.1111/emip.70018","url":null,"abstract":"<p>Large language models (LLMs) have been widely explored for automated scoring in educational assessment to facilitate learning and instruction. However, empirical evidence regarding which LLMs produce the most reliable scores and induce the least rater effects remains limited. This study compared 10 LLMs (ChatGPT 3.5, ChatGPT 4, ChatGPT 4o, OpenAI o1, Claude 3.5 Sonnet, Gemini 1.5, Gemini 1.5 Pro, Gemini 2.0, DeepSeek V3, and DeepSeek R1) with human expert raters in scoring two types of writing tasks. Their performance was evaluated in terms of score accuracy, intra-rater consistency, and rater effects estimated using the Many-Facet Rasch model. Although the results generally supported the use of ChatGPT 4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet with high scoring accuracy, better intra-rater consistency, and less rater effects, the study is not intended to support substantive comparisons or rankings of LLMs or to identify a single “best” model, given the small sample size.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 2","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147565348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Limitations of QWK in Evaluating Automated and Human Scoring Systems","authors":"Jennifer Lewis, Jodi M. Casabianca","doi":"10.1111/emip.70017","DOIUrl":"10.1111/emip.70017","url":null,"abstract":"<p>To assess the interrater reliability of human ratings of constructed responses (CR), or the accuracy of scores given by automated scoring engines, concordance metrics quantify agreement between measures. This article examines the quadratic weighted kappa (QWK) in these contexts and highlights its practical limitations compared to other metrics. Both empirical and simulation study results reveal how different factors including the shape of the marginal distributions and score scale length may impact the estimates and how we can adjust for these properties of the contingency table. The results highlight the QWK's sensitivities and suggest that additional caution should be taken before decisions about whether to keep a CR item on a test form are made. If using QWK without the proper interpretive supports, such decisions may be misinformed. Consequently, we make suggestions for best practices to promote responsible evaluation of agreement in the context of CR scoring in educational testing.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147568745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Limitations of QWK in Evaluating Automated and Human Scoring Systems","authors":"Jennifer Lewis, Jodi M. Casabianca","doi":"10.1111/emip.70017","DOIUrl":"10.1111/emip.70017","url":null,"abstract":"<p>To assess the interrater reliability of human ratings of constructed responses (CR), or the accuracy of scores given by automated scoring engines, concordance metrics quantify agreement between measures. This article examines the quadratic weighted kappa (QWK) in these contexts and highlights its practical limitations compared to other metrics. Both empirical and simulation study results reveal how different factors including the shape of the marginal distributions and score scale length may impact the estimates and how we can adjust for these properties of the contingency table. The results highlight the QWK's sensitivities and suggest that additional caution should be taken before decisions about whether to keep a CR item on a test form are made. If using QWK without the proper interpretive supports, such decisions may be misinformed. Consequently, we make suggestions for best practices to promote responsible evaluation of agreement in the context of CR scoring in educational testing.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147568711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Classroom Assessment Validation: Proficiency Claims and Uses","authors":"James H. McMillan","doi":"10.1111/emip.70014","DOIUrl":"10.1111/emip.70014","url":null,"abstract":"<p>Unlike standardized testing applications of validity, teachers need a simple and efficient way to reflect on the accuracy of the claims based on student performance, then consider whether the uses of those claims are appropriate. A two-phase reasoning process of validation, consisting of a proficiency claim /argument and a use/argument, is presented as a way for teachers to understand and apply the central tenets of validation to their classroom assessments. Since classroom assessment is contextualized with multiple purposes, each teacher is obligated to use validation for their situation. The accuracy of teachers’ conclusions about the proficiency claims, and uses, will depend on their skill in gathering supportive evidence and considering alternative explanations. Examples of the proposed classroom assessment validation process are presented.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146139296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joshua B. Gilbert, James G. Soland, Benjamin W. Domingue
{"title":"The Sensitivity of Value-Added Estimates to Test Scoring Decisions","authors":"Joshua B. Gilbert, James G. Soland, Benjamin W. Domingue","doi":"10.1111/emip.70011","DOIUrl":"10.1111/emip.70011","url":null,"abstract":"<p>Value-added models (VAMs) are both common and controversial in education policy and accountability research. While the sensitivity of VAM results to model specification and covariate selection is well documented, the extent to which test scoring methods (e.g., mean scores vs. item response theory based scores) may affect Value-added (VA) estimates is less studied. We examine the sensitivity of VA estimates to the scoring method using empirical item response data from 18 education datasets. We find that VA estimates can be sensitive to the choice of scoring method, holding constant students and items. While the various test scores are highly correlated, on average, using different scoring approaches leads to variation in VA percentile ranks of over 20 points, and more than 50% of teachers or schools are classified in multiple quartiles of the VA distribution. Dispersion in VA ranks is reduced with more complete item response data. Our findings suggest that consideration of both measurement error and model uncertainty are important for the appropriate interpretation of VAMs.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146155055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Zhong, Jiangang Hao, Michael Fauss, Chen Li, Yuan Wang
{"title":"AI-Generated Essays: Characteristics and Implications on Automated Scoring and Academic Integrity","authors":"Yang Zhong, Jiangang Hao, Michael Fauss, Chen Li, Yuan Wang","doi":"10.1111/emip.70013","DOIUrl":"10.1111/emip.70013","url":null,"abstract":"<p>The rapid advancement of large language models (LLMs) has enabled the generation of coherent essays, making AI-assisted writing increasingly common in educational and professional settings. Using large-scale empirical data, we examine and benchmark the characteristics and quality of essays generated by popular LLMs and discuss their implications for two key components of writing assessments: automated scoring and academic integrity. Our findings highlight limitations in existing automated scoring systems when applied to essays generated or heavily influenced by AI, and identify areas for improvement, including the development of new features to capture deeper thinking and recalibrating feature weights. Despite growing concerns that the increasing variety of LLMs may undermine the feasibility of detecting AI-generated essays, our results show that detectors trained on essays generated from one model can often identify texts from others with high accuracy, suggesting that effective detection could remain manageable in practice.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146136621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}