Educational Measurement-Issues and Practice最新文献

筛选
英文 中文
When Is Classroom Assessment Educational Measurement? 课堂评估何时是教育测量?
IF 1.9 4区 教育学
Educational Measurement-Issues and Practice Pub Date : 2026-04-06 DOI: 10.1111/emip.70021
Susan M. Brookhart, Sarah M. Bonner
{"title":"When Is Classroom Assessment Educational Measurement?","authors":"Susan M. Brookhart,&nbsp;Sarah M. Bonner","doi":"10.1111/emip.70021","DOIUrl":"https://doi.org/10.1111/emip.70021","url":null,"abstract":"<p>The relationship between classroom assessment and educational measurement has been under discussion for some time. This article uses the TISM framework (<i>Theory, Instrumentation, Scales and units</i>, and <i>Modeling</i>) to clarify which aspects of classroom assessment are educational measurement (e.g., a grade on a performance assessment keyed to a learning standard) and which are not (e.g., extended elaborated feedback on that same assessment). We conclude that classroom assessment which produces ordinal or interval-level quantitative scores—by whatever name they are called, including scores, grades, and performance levels—is educational measurement because it implicates theory, instrumentation, scales and units, and modeling of error. On this basis, we claim that work in classroom assessment and educational measurement can and should be mutually informative.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 2","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147715036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Confidence Modeling to Optimize Overall Score Quality in Hybrid Scoring Systems 利用置信度模型优化混合评分系统的总分质量
IF 1.9 4区 教育学
Educational Measurement-Issues and Practice Pub Date : 2026-04-05 DOI: 10.1111/emip.70019
Alexander Kwako, Susan Lottridge, Christopher Ormerod
{"title":"Using Confidence Modeling to Optimize Overall Score Quality in Hybrid Scoring Systems","authors":"Alexander Kwako,&nbsp;Susan Lottridge,&nbsp;Christopher Ormerod","doi":"10.1111/emip.70019","DOIUrl":"https://doi.org/10.1111/emip.70019","url":null,"abstract":"<p>In large-scale assessments, constructed response items are often scored using hybrid scoring systems, which combine human and automated scores. In this study, we augment automated scoring with confidence modeling to strategically route difficult-to-score responses for human review. We utilize <i>hybrid performance curves</i> to visualize the impact of routing on performance. Additionally, we propose several <i>hybrid scoring policies</i> for selecting optimal routing thresholds given practical constraints. Our findings reveal that hybrid scoring systems can achieve an overall performance that exceeds that of human- and automated-only systems. Moreover, the superior performance of the hybrid system is less expensive than a human-only system. These findings highlight the complementarity of human raters and automated scoring engines. Although current standards focus on the performance of human raters and automated scoring engines in isolation, we recommend that practitioners also report on the performance of the hybrid scoring system as a whole.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 2","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.70019","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147714992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing Large-Scale Mathematical Assessments: Leveraging Hierarchical Attribute Structures and Diagnostic Classification Models for Enhanced Student Diagnostics 优化大规模数学评估:利用层次属性结构和诊断分类模型增强学生诊断
IF 1.9 4区 教育学
Educational Measurement-Issues and Practice Pub Date : 2026-03-17 DOI: 10.1111/emip.70016
Farshad Effatpanah, Olga Kunina-Habenicht, Steve Bernard, Caroline Hornung, Philipp Sonnleitner
{"title":"Optimizing Large-Scale Mathematical Assessments: Leveraging Hierarchical Attribute Structures and Diagnostic Classification Models for Enhanced Student Diagnostics","authors":"Farshad Effatpanah,&nbsp;Olga Kunina-Habenicht,&nbsp;Steve Bernard,&nbsp;Caroline Hornung,&nbsp;Philipp Sonnleitner","doi":"10.1111/emip.70016","DOIUrl":"https://doi.org/10.1111/emip.70016","url":null,"abstract":"<p>Diagnostic classification models (DCMs) assess students’ mastery of cognitive attributes to provide personalized ability profiles. Retrofitting DCMs to large-scale mathematics assessments usually relies on inferred Q-matrices, which can reduce accuracy and diagnostic value. This study evaluated whether constructing items from cognitive models—yielding Q-matrices directly—and incorporating hierarchical relationships among attributes improve diagnostic outcomes. Responses from 5,336 third-grade students to a Luxembourgish image-based, large-scale standardized mathematics exam were analyzed using multiple DCMs and their hierarchical extensions. Items were constructed based on a Q-matrix, derived from the curriculum and cognitive models. The hierarchical A-CDM outperformed other models, classifying students into 60 latent classes with acceptable attribute- and test-level accuracy and more interpretable results than the G-DINA model. Using cognitive model-based item generation and Q-matrices as well as specifying attribute hierarchies enhance the accuracy and interpretability of DCM-based diagnostics in large-scale assessments, complementing traditional psychometric approaches by discerning meaningful within-score differences.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 2","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.70016","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147566730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating Rater Effects of Large Language Models in Automated Essay Scoring: GPT, Claude, Gemini, and DeepSeek 在自动作文评分中评估大型语言模型的评分效果:GPT, Claude, Gemini和DeepSeek
IF 1.9 4区 教育学
Educational Measurement-Issues and Practice Pub Date : 2026-03-12 DOI: 10.1111/emip.70018
Hong Jiao, Dan Song, Won-Chan Lee
{"title":"Evaluating Rater Effects of Large Language Models in Automated Essay Scoring: GPT, Claude, Gemini, and DeepSeek","authors":"Hong Jiao,&nbsp;Dan Song,&nbsp;Won-Chan Lee","doi":"10.1111/emip.70018","DOIUrl":"https://doi.org/10.1111/emip.70018","url":null,"abstract":"<p>Large language models (LLMs) have been widely explored for automated scoring in educational assessment to facilitate learning and instruction. However, empirical evidence regarding which LLMs produce the most reliable scores and induce the least rater effects remains limited. This study compared 10 LLMs (ChatGPT 3.5, ChatGPT 4, ChatGPT 4o, OpenAI o1, Claude 3.5 Sonnet, Gemini 1.5, Gemini 1.5 Pro, Gemini 2.0, DeepSeek V3, and DeepSeek R1) with human expert raters in scoring two types of writing tasks. Their performance was evaluated in terms of score accuracy, intra-rater consistency, and rater effects estimated using the Many-Facet Rasch model. Although the results generally supported the use of ChatGPT 4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet with high scoring accuracy, better intra-rater consistency, and less rater effects, the study is not intended to support substantive comparisons or rankings of LLMs or to identify a single “best” model, given the small sample size.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 2","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147565348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Limitations of QWK in Evaluating Automated and Human Scoring Systems QWK在评估自动和人工评分系统中的局限性
IF 1.9 4区 教育学
Educational Measurement-Issues and Practice Pub Date : 2026-02-25 DOI: 10.1111/emip.70017
Jennifer Lewis, Jodi M. Casabianca
{"title":"Limitations of QWK in Evaluating Automated and Human Scoring Systems","authors":"Jennifer Lewis,&nbsp;Jodi M. Casabianca","doi":"10.1111/emip.70017","DOIUrl":"10.1111/emip.70017","url":null,"abstract":"<p>To assess the interrater reliability of human ratings of constructed responses (CR), or the accuracy of scores given by automated scoring engines, concordance metrics quantify agreement between measures. This article examines the quadratic weighted kappa (QWK) in these contexts and highlights its practical limitations compared to other metrics. Both empirical and simulation study results reveal how different factors including the shape of the marginal distributions and score scale length may impact the estimates and how we can adjust for these properties of the contingency table. The results highlight the QWK's sensitivities and suggest that additional caution should be taken before decisions about whether to keep a CR item on a test form are made. If using QWK without the proper interpretive supports, such decisions may be misinformed. Consequently, we make suggestions for best practices to promote responsible evaluation of agreement in the context of CR scoring in educational testing.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147568745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Limitations of QWK in Evaluating Automated and Human Scoring Systems QWK在评估自动和人工评分系统中的局限性
IF 1.9 4区 教育学
Educational Measurement-Issues and Practice Pub Date : 2026-02-25 DOI: 10.1111/emip.70017
Jennifer Lewis, Jodi M. Casabianca
{"title":"Limitations of QWK in Evaluating Automated and Human Scoring Systems","authors":"Jennifer Lewis,&nbsp;Jodi M. Casabianca","doi":"10.1111/emip.70017","DOIUrl":"10.1111/emip.70017","url":null,"abstract":"<p>To assess the interrater reliability of human ratings of constructed responses (CR), or the accuracy of scores given by automated scoring engines, concordance metrics quantify agreement between measures. This article examines the quadratic weighted kappa (QWK) in these contexts and highlights its practical limitations compared to other metrics. Both empirical and simulation study results reveal how different factors including the shape of the marginal distributions and score scale length may impact the estimates and how we can adjust for these properties of the contingency table. The results highlight the QWK's sensitivities and suggest that additional caution should be taken before decisions about whether to keep a CR item on a test form are made. If using QWK without the proper interpretive supports, such decisions may be misinformed. Consequently, we make suggestions for best practices to promote responsible evaluation of agreement in the context of CR scoring in educational testing.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147568711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Classroom Assessment Validation: Proficiency Claims and Uses 课堂评估验证:熟练程度声明和使用
IF 1.9 4区 教育学
Educational Measurement-Issues and Practice Pub Date : 2026-02-04 DOI: 10.1111/emip.70014
James H. McMillan
{"title":"Classroom Assessment Validation: Proficiency Claims and Uses","authors":"James H. McMillan","doi":"10.1111/emip.70014","DOIUrl":"10.1111/emip.70014","url":null,"abstract":"<p>Unlike standardized testing applications of validity, teachers need a simple and efficient way to reflect on the accuracy of the claims based on student performance, then consider whether the uses of those claims are appropriate. A two-phase reasoning process of validation, consisting of a proficiency claim /argument and a use/argument, is presented as a way for teachers to understand and apply the central tenets of validation to their classroom assessments. Since classroom assessment is contextualized with multiple purposes, each teacher is obligated to use validation for their situation. The accuracy of teachers’ conclusions about the proficiency claims, and uses, will depend on their skill in gathering supportive evidence and considering alternative explanations. Examples of the proposed classroom assessment validation process are presented.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146139296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Sensitivity of Value-Added Estimates to Test Scoring Decisions 增值评估对考试评分决策的敏感性
IF 1.9 4区 教育学
Educational Measurement-Issues and Practice Pub Date : 2026-02-03 DOI: 10.1111/emip.70011
Joshua B. Gilbert, James G. Soland, Benjamin W. Domingue
{"title":"The Sensitivity of Value-Added Estimates to Test Scoring Decisions","authors":"Joshua B. Gilbert,&nbsp;James G. Soland,&nbsp;Benjamin W. Domingue","doi":"10.1111/emip.70011","DOIUrl":"10.1111/emip.70011","url":null,"abstract":"<p>Value-added models (VAMs) are both common and controversial in education policy and accountability research. While the sensitivity of VAM results to model specification and covariate selection is well documented, the extent to which test scoring methods (e.g., mean scores vs. item response theory based scores) may affect Value-added (VA) estimates is less studied. We examine the sensitivity of VA estimates to the scoring method using empirical item response data from 18 education datasets. We find that VA estimates can be sensitive to the choice of scoring method, holding constant students and items. While the various test scores are highly correlated, on average, using different scoring approaches leads to variation in VA percentile ranks of over 20 points, and more than 50% of teachers or schools are classified in multiple quartiles of the VA distribution. Dispersion in VA ranks is reduced with more complete item response data. Our findings suggest that consideration of both measurement error and model uncertainty are important for the appropriate interpretation of VAMs.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146155055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AI-Generated Essays: Characteristics and Implications on Automated Scoring and Academic Integrity 人工智能生成的论文:自动评分和学术诚信的特点和影响
IF 1.9 4区 教育学
Educational Measurement-Issues and Practice Pub Date : 2026-01-29 DOI: 10.1111/emip.70013
Yang Zhong, Jiangang Hao, Michael Fauss, Chen Li, Yuan Wang
{"title":"AI-Generated Essays: Characteristics and Implications on Automated Scoring and Academic Integrity","authors":"Yang Zhong,&nbsp;Jiangang Hao,&nbsp;Michael Fauss,&nbsp;Chen Li,&nbsp;Yuan Wang","doi":"10.1111/emip.70013","DOIUrl":"10.1111/emip.70013","url":null,"abstract":"<p>The rapid advancement of large language models (LLMs) has enabled the generation of coherent essays, making AI-assisted writing increasingly common in educational and professional settings. Using large-scale empirical data, we examine and benchmark the characteristics and quality of essays generated by popular LLMs and discuss their implications for two key components of writing assessments: automated scoring and academic integrity. Our findings highlight limitations in existing automated scoring systems when applied to essays generated or heavily influenced by AI, and identify areas for improvement, including the development of new features to capture deeper thinking and recalibrating feature weights. Despite growing concerns that the increasing variety of LLMs may undermine the feasibility of detecting AI-generated essays, our results show that detectors trained on essays generated from one model can often identify texts from others with high accuracy, suggesting that effective detection could remain manageable in practice.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146136621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Issue Cover 覆盖问题
IF 1.9 4区 教育学
Educational Measurement-Issues and Practice Pub Date : 2025-09-08 DOI: 10.1111/emip.70005
{"title":"Issue Cover","authors":"","doi":"10.1111/emip.70005","DOIUrl":"https://doi.org/10.1111/emip.70005","url":null,"abstract":"","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"44 3","pages":""},"PeriodicalIF":1.9,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.70005","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145012007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书