Journal of Educational Measurement最新文献

Evaluating the Consistency and Reliability of Attribution Methods in Automated Short Answer Grading (ASAG) Systems: Toward an Explainable Scoring System 评估自动简答评分（ASAG）系统中归属方法的一致性和可靠性：迈向一个可解释的评分系统

IF 1.4 4区心理学

Journal of Educational Measurement Pub Date : 2025-06-01 DOI: 10.1111/jedm.12438

Wallace N. Pinto Jr, Jinnie Shin

{"title":"Evaluating the Consistency and Reliability of Attribution Methods in Automated Short Answer Grading (ASAG) Systems: Toward an Explainable Scoring System","authors":"Wallace N. Pinto Jr, Jinnie Shin","doi":"10.1111/jedm.12438","DOIUrl":"https://doi.org/10.1111/jedm.12438","url":null,"abstract":"In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates the use of attribution scores in ASAG systems, focusing on their consistency in reflecting model decisions. Specifically, we examined how attribution scores generated by different methods—namely Local Interpretable Model-agnostic Explanations (LIME), Integrated Gradients (IG), Hierarchical Explanation via Divisive Generation (HEDGE), and Leave-One-Out (LOO)—compare in their consistency and ability to illustrate the decision-making processes of transformer-based scoring systems trained on a publicly available response dataset. Additionally, we analyzed how attribution scores varied across different scoring categories in a polytomously scored response dataset and across two transformer-based scoring model architectures: Bidirectional Encoder Representations from Transformers (BERT) and Decoding-enhanced BERT with Disentangled Attention (DeBERTa-v2). Our findings highlight the challenges in evaluating explainability metrics, with important implications for both high-stakes and formative assessment contexts. This study contributes to the development of more reliable and transparent ASAG systems.","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"248-281"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144525053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparing and Combining IRTree Models and Anchoring Vignettes in Addressing Response Styles IRTree模型与锚定小片段在寻址响应风格中的比较与结合

IF 1.4 4区心理学

Journal of Educational Measurement Pub Date : 2025-05-06 DOI: 10.1111/jedm.12437

Mingfeng Xue, Ping Chen

引用次数: 0

Validation for Personalized Assessments: A Threats-to-Validity Approach 个性化评估的验证：从威胁到有效性的方法

IF 1.4 4区心理学

Journal of Educational Measurement Pub Date : 2025-04-23 DOI: 10.1111/jedm.12434

Sandip Sinharay, Randy E. Bennett, Michael Kane, Jesse R. Sparks

引用次数: 0

Using Multiple Maximum Exposure Rates in Computerized Adaptive Testing 在计算机化自适应测试中使用多重最大暴露率

IF 1.4 4区心理学

Journal of Educational Measurement Pub Date : 2025-04-16 DOI: 10.1111/jedm.12436

Kylie Gorney, Mark D. Reckase

引用次数: 0

Theory-Driven IRT Modeling of Vocabulary Development: Matthew Effects and the Case for Unipolar IRT 词汇发展的理论驱动IRT模型：马修效应和单极IRT案例

IF 1.4 4区心理学

Journal of Educational Measurement Pub Date : 2025-04-11 DOI: 10.1111/jedm.12433

Qi (Helen) Huang, Daniel M. Bolt, Xiangyi Liao

{"title":"Theory-Driven IRT Modeling of Vocabulary Development: Matthew Effects and the Case for Unipolar IRT","authors":"Qi (Helen) Huang, Daniel M. Bolt, Xiangyi Liao","doi":"10.1111/jedm.12433","DOIUrl":"https://doi.org/10.1111/jedm.12433","url":null,"abstract":"Item response theory (IRT) encompasses a broader class of measurement models than is commonly appreciated by practitioners in educational measurement. For measures of vocabulary and its development, we show how psychological theory might in certain instances support unipolar IRT modeling as a superior alternative to the more traditional bipolar IRT models fit in practice. Although corresponding model choices make unipolar IRT statistically equivalent with bipolar IRT, adopting the unipolar approach substantially alters the resulting metric for proficiency. This shift can have substantial implications for educational research and practices that depend heavily on interval-level score interpretations. As an example, we illustrate through simulation how the perspective of unipolar IRT may account for inconsistencies seen across empirical studies in the observation (or lack thereof) of Matthew effects in reading/vocabulary development (i.e., growth being positively correlated with baseline proficiency), despite theoretical expectations for their presence. Additionally, a unipolar measurement perspective can reflect the anticipated diversification of vocabulary as proficiency level increases. Implications of unipolar IRT representations for constructing tests of vocabulary proficiency and evaluating measurement error are discussed.","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"199-224"},"PeriodicalIF":1.4,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12433","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Another Look at Yen's Q3: Is .2 an Appropriate Cut-Off? 再看日元Q3: 0.2是一个合适的临界值吗？

IF 1.4 4区心理学

Journal of Educational Measurement Pub Date : 2025-04-04 DOI: 10.1111/jedm.12432

Kelsey Nason, Christine DeMars

引用次数: 0

A Comparison of Anchor Selection Strategies for DIF Analysis DIF分析锚点选择策略的比较

IF 1.4 4区心理学

Journal of Educational Measurement Pub Date : 2025-03-20 DOI: 10.1111/jedm.12429

Haeju Lee, Kyung Yong Kim

{"title":"A Comparison of Anchor Selection Strategies for DIF Analysis","authors":"Haeju Lee, Kyung Yong Kim","doi":"10.1111/jedm.12429","DOIUrl":"https://doi.org/10.1111/jedm.12429","url":null,"abstract":"When no prior information of differential item functioning (DIF) exists for items in a test, either the rank-based or iterative purification procedure might be preferred. The rank-based purification selects anchor items based on a preliminary DIF test. For a preliminary DIF test, likelihood ratio test (LRT) based approaches (e.g., all-others-as-anchors: AOAA and one-item-anchor: OIA) and an improved version of Lord's Wald test (i.e., anchor-all-test-all: AATA) have been used in research studies. However, both LRT- and Wald-based procedures often select DIF items as anchor items and as a result, inflate Type <math>\u0000 <semantics>\u0000 <mi>I</mi>\u0000 <annotation>${mathrm{{mathrm I}}}$</annotation>\u0000 </semantics></math> error rates. To overcome this issue, minimum test statistics (<math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>Min</mi>\u0000 <mspace></mspace>\u0000 <msup>\u0000 <mi>G</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 </mrow>\u0000 <annotation>${mathrm{Min}};{G^2}$</annotation>\u0000 </semantics></math>/<math>\u0000 <semantics>\u0000 <msup>\u0000 <mi>χ</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 <annotation>${chi ^2}$</annotation>\u0000 </semantics></math>) or items with nonsignificant test statistics and large discrimination parameter estimates (<math>\u0000 <semantics>\u0000 <mi>NonsigMax</mi>\u0000 <annotation>${mathrm{NonsigMax}}$</annotation>\u0000 </semantics></math>A) have been suggested in the literature to select anchor items. Nevertheless, little research has been done comparing combinations of the three anchor selection procedures paired with the two anchor selection criteria. Thus, the performance of the six rank-based strategies was compared in this study in terms of accuracy, power, and Type <math>\u0000 <semantics>\u0000 <mi>I</mi>\u0000 <annotation>${mathrm{{mathrm I}}}$</annotation>\u0000 </semantics></math> error rates. Among the rank-based strategies, the AOAA-based strategies demonstrated greater robustness across various conditions compared to the AATA- and OIA-based strategies. Additionally, the <math>\u0000 <semantics>\u0000 <mrow>\u0000 <mrow>\u0000 <mi>Min</mi>\u0000 <mspace></mspace>\u0000 </mrow>\u0000 <msup>\u0000 <mi>G</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 </mrow>\u0000 <annotation>${mathrm{Min;}}{G^2}$</annotation>\u0000 </semantics></math>/<math>\u0000 <se","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"311-344"},"PeriodicalIF":1.4,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12429","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144525159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Vulnerability of AI-Based Scoring Systems to Gaming Strategies: A Case Study 基于ai的计分系统对游戏策略的脆弱性：案例研究

IF 1.4 4区心理学

Journal of Educational Measurement Pub Date : 2025-02-20 DOI: 10.1111/jedm.12427

Peter Baldwin, Victoria Yaneva, Kai North, Le An Ha, Yiyun Zhou, Alex J. Mechaber, Brian E. Clauser

引用次数: 0

Using Multilabel Neural Network to Score High-Dimensional Assessments for Different Use Foci: An Example with College Major Preference Assessment 使用多标签神经网络为不同使用重点的高维评估打分：以大学专业偏好评估为例

IF 1.4 4区心理学

Journal of Educational Measurement Pub Date : 2025-01-14 DOI: 10.1111/jedm.12424

Shun-Fu Hu, Amery D. Wu, Jake Stone

引用次数: 0

IRT Observed-Score Equating for Rater-Mediated Assessments Using a Hierarchical Rater Model 使用分层评分模型对评分介导的评估进行IRT观察得分相等

IF 1.4 4区心理学

Journal of Educational Measurement Pub Date : 2025-01-13 DOI: 10.1111/jedm.12425

Tong Wu, Stella Y. Kim, Carl Westine, Michelle Boyer

{"title":"IRT Observed-Score Equating for Rater-Mediated Assessments Using a Hierarchical Rater Model","authors":"Tong Wu, Stella Y. Kim, Carl Westine, Michelle Boyer","doi":"10.1111/jedm.12425","DOIUrl":"https://doi.org/10.1111/jedm.12425","url":null,"abstract":"While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating method that accounts for rater errors by utilizing item response theory (IRT) observed-score equating with a hierarchical rater model (HRM). Its effectiveness is compared to an IRT observed-score equating method using the generalized partial credit model across 16 rater combinations with varying levels of rater bias and variability. The results indicate that equating performance depends on the interaction between rater bias and variability across forms. Both the proposed and traditional methods demonstrated robustness in terms of bias and RMSE when rater bias and variability were similar between forms, with a few exceptions. However, when rater errors varied significantly across forms, the proposed method consistently produced more stable equating results. Differences in standard error between the methods were minimal under most conditions.","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 1","pages":"145-171"},"PeriodicalIF":1.4,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0