{"title":"Evaluating the Consistency and Reliability of Attribution Methods in Automated Short Answer Grading (ASAG) Systems: Toward an Explainable Scoring System","authors":"Wallace N. Pinto Jr, Jinnie Shin","doi":"10.1111/jedm.12438","DOIUrl":"https://doi.org/10.1111/jedm.12438","url":null,"abstract":"<p>In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates the use of attribution scores in ASAG systems, focusing on their consistency in reflecting model decisions. Specifically, we examined how attribution scores generated by different methods—namely Local Interpretable Model-agnostic Explanations (LIME), Integrated Gradients (IG), Hierarchical Explanation via Divisive Generation (HEDGE), and Leave-One-Out (LOO)—compare in their consistency and ability to illustrate the decision-making processes of transformer-based scoring systems trained on a publicly available response dataset. Additionally, we analyzed how attribution scores varied across different scoring categories in a polytomously scored response dataset and across two transformer-based scoring model architectures: Bidirectional Encoder Representations from Transformers (BERT) and Decoding-enhanced BERT with Disentangled Attention (DeBERTa-v2). Our findings highlight the challenges in evaluating explainability metrics, with important implications for both high-stakes and formative assessment contexts. This study contributes to the development of more reliable and transparent ASAG systems.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"248-281"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144525053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparing and Combining IRTree Models and Anchoring Vignettes in Addressing Response Styles","authors":"Mingfeng Xue, Ping Chen","doi":"10.1111/jedm.12437","DOIUrl":"https://doi.org/10.1111/jedm.12437","url":null,"abstract":"<p>Response styles pose great threats to psychological measurements. This research compares IRTree models and anchoring vignettes in addressing response styles and estimating the target traits. It also explores the potential of combining them at the item level and total-score level (ratios of extreme and middle responses to vignettes). Four models were evaluated: three multidimensional IRTree models with different levels of using vignette data and a nominal response model (NRM) addressing extreme and midpoint response styles with item-level vignette responses. Simulation results indicated that the IRTree model using item-level vignette responses outperformed others in estimating the target trait and response styles to different extents, with performance improving as the number of vignettes increased. Empirical findings further demonstrated that models using item-level vignette information yielded higher reliability and closely aligned target trait estimates. These results underscore the value of integrating anchoring vignettes with IRTree models to enhance estimation accuracy and control for response styles.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"225-247"},"PeriodicalIF":1.4,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sandip Sinharay, Randy E. Bennett, Michael Kane, Jesse R. Sparks
{"title":"Validation for Personalized Assessments: A Threats-to-Validity Approach","authors":"Sandip Sinharay, Randy E. Bennett, Michael Kane, Jesse R. Sparks","doi":"10.1111/jedm.12434","DOIUrl":"https://doi.org/10.1111/jedm.12434","url":null,"abstract":"<p>Personalized assessments are of increasing interest because of their potential to lead to more equitable decisions about the examinees. However, one obstacle to the widespread use of personalized assessments is the lack of a measurement toolkit that can be used to analyze data from these assessments. This article takes one step toward building such a toolkit by proposing a validation framework for personalized assessments. The framework is built on the threats-to-validity approach. We demonstrate applications of the suggested framework using the AP 3D Art and Design Portfolio examination and a more restrictive culturally relevant assessment as examples.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"282-310"},"PeriodicalIF":1.4,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Multiple Maximum Exposure Rates in Computerized Adaptive Testing","authors":"Kylie Gorney, Mark D. Reckase","doi":"10.1111/jedm.12436","DOIUrl":"https://doi.org/10.1111/jedm.12436","url":null,"abstract":"<p>In computerized adaptive testing, item exposure control methods are often used to provide a more balanced usage of the item pool. Many of the most popular methods, including the restricted method (Revuelta and Ponsoda), use a single maximum exposure rate to limit the proportion of times that each item is administered. However, Barrada et al. showed that by using multiple maximum exposure rates, it is possible to obtain an even more balanced usage of the item pool. Therefore, in this paper, we develop four extensions of the restricted method that involve the use of multiple maximum exposure rates. A detailed simulation study reveals that (a) all four of the new methods improve item pool utilization and (b) three of the new methods also improve measurement accuracy. Taken together, these results are highly encouraging, as they reveal that it is possible to improve both types of outcomes simultaneously.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"360-379"},"PeriodicalIF":1.4,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12436","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Theory-Driven IRT Modeling of Vocabulary Development: Matthew Effects and the Case for Unipolar IRT","authors":"Qi (Helen) Huang, Daniel M. Bolt, Xiangyi Liao","doi":"10.1111/jedm.12433","DOIUrl":"https://doi.org/10.1111/jedm.12433","url":null,"abstract":"<p>Item response theory (IRT) encompasses a broader class of measurement models than is commonly appreciated by practitioners in educational measurement. For measures of vocabulary and its development, we show how psychological theory might in certain instances support unipolar IRT modeling as a superior alternative to the more traditional bipolar IRT models fit in practice. Although corresponding model choices make unipolar IRT statistically equivalent with bipolar IRT, adopting the unipolar approach substantially alters the resulting metric for proficiency. This shift can have substantial implications for educational research and practices that depend heavily on interval-level score interpretations. As an example, we illustrate through simulation how the perspective of unipolar IRT may account for inconsistencies seen across empirical studies in the observation (or lack thereof) of Matthew effects in reading/vocabulary development (i.e., growth being positively correlated with baseline proficiency), despite theoretical expectations for their presence. Additionally, a unipolar measurement perspective can reflect the anticipated diversification of vocabulary as proficiency level increases. Implications of unipolar IRT representations for constructing tests of vocabulary proficiency and evaluating measurement error are discussed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"199-224"},"PeriodicalIF":1.4,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12433","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Another Look at Yen's Q3: Is .2 an Appropriate Cut-Off?","authors":"Kelsey Nason, Christine DeMars","doi":"10.1111/jedm.12432","DOIUrl":"https://doi.org/10.1111/jedm.12432","url":null,"abstract":"<p>This study examined the widely used threshold of .2 for Yen's Q3, an index for violations of local independence. Specifically, a simulation was conducted to investigate whether Q3 values were related to the magnitude of bias in estimates of reliability, item parameters, and examinee ability. Results showed that Q3 values below the typical cut-off yielded meaningful bias in estimates. Practical implications and limitations are discussed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"345-359"},"PeriodicalIF":1.4,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12432","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Comparison of Anchor Selection Strategies for DIF Analysis","authors":"Haeju Lee, Kyung Yong Kim","doi":"10.1111/jedm.12429","DOIUrl":"https://doi.org/10.1111/jedm.12429","url":null,"abstract":"<p>When no prior information of differential item functioning (DIF) exists for items in a test, either the rank-based or iterative purification procedure might be preferred. The rank-based purification selects anchor items based on a preliminary DIF test. For a preliminary DIF test, likelihood ratio test (LRT) based approaches (e.g., all-others-as-anchors: AOAA and one-item-anchor: OIA) and an improved version of Lord's Wald test (i.e., anchor-all-test-all: AATA) have been used in research studies. However, both LRT- and Wald-based procedures often select DIF items as anchor items and as a result, inflate Type <span></span><math>\u0000 <semantics>\u0000 <mi>I</mi>\u0000 <annotation>${mathrm{{mathrm I}}}$</annotation>\u0000 </semantics></math> error rates. To overcome this issue, minimum test statistics (<span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>Min</mi>\u0000 <mspace></mspace>\u0000 <msup>\u0000 <mi>G</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 </mrow>\u0000 <annotation>${mathrm{Min}};{G^2}$</annotation>\u0000 </semantics></math>/<span></span><math>\u0000 <semantics>\u0000 <msup>\u0000 <mi>χ</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 <annotation>${chi ^2}$</annotation>\u0000 </semantics></math>) or items with nonsignificant test statistics and large discrimination parameter estimates (<span></span><math>\u0000 <semantics>\u0000 <mi>NonsigMax</mi>\u0000 <annotation>${mathrm{NonsigMax}}$</annotation>\u0000 </semantics></math><i>A</i>) have been suggested in the literature to select anchor items. Nevertheless, little research has been done comparing combinations of the three anchor selection procedures paired with the two anchor selection criteria. Thus, the performance of the six rank-based strategies was compared in this study in terms of accuracy, power, and Type <span></span><math>\u0000 <semantics>\u0000 <mi>I</mi>\u0000 <annotation>${mathrm{{mathrm I}}}$</annotation>\u0000 </semantics></math> error rates. Among the rank-based strategies, the AOAA-based strategies demonstrated greater robustness across various conditions compared to the AATA- and OIA-based strategies. Additionally, the <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mrow>\u0000 <mi>Min</mi>\u0000 <mspace></mspace>\u0000 </mrow>\u0000 <msup>\u0000 <mi>G</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 </mrow>\u0000 <annotation>${mathrm{Min;}}{G^2}$</annotation>\u0000 </semantics></math>/<span></span><math>\u0000 <se","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"311-344"},"PeriodicalIF":1.4,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12429","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144525159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peter Baldwin, Victoria Yaneva, Kai North, Le An Ha, Yiyun Zhou, Alex J. Mechaber, Brian E. Clauser
{"title":"The Vulnerability of AI-Based Scoring Systems to Gaming Strategies: A Case Study","authors":"Peter Baldwin, Victoria Yaneva, Kai North, Le An Ha, Yiyun Zhou, Alex J. Mechaber, Brian E. Clauser","doi":"10.1111/jedm.12427","DOIUrl":"https://doi.org/10.1111/jedm.12427","url":null,"abstract":"<p>Recent developments in the use of large-language models have led to substantial improvements in the accuracy of content-based automated scoring of free-text responses. The reported accuracy levels suggest that automated systems could have widespread applicability in assessment. However, before they are used in operational testing, other aspects of their performance warrant examination. In this study, we explore the potential for examinees to inflate their scores by gaming the ACTA automated scoring system. We explore a range of strategies including responding with words selected from the item stem and responding with multiple answers. These responses would be easily identified as incorrect by a human rater but may result in false-positive classifications from an automated system. Our results show that the rate at which these strategies produce responses that are scored as correct varied across items and across strategies but that several vulnerabilities exist.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 1","pages":"172-194"},"PeriodicalIF":1.4,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Multilabel Neural Network to Score High-Dimensional Assessments for Different Use Foci: An Example with College Major Preference Assessment","authors":"Shun-Fu Hu, Amery D. Wu, Jake Stone","doi":"10.1111/jedm.12424","DOIUrl":"https://doi.org/10.1111/jedm.12424","url":null,"abstract":"<p>Scoring high-dimensional assessments (e.g., > 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or precision, to suit users' varying needs. These two objectives are illustrated with an example of scoring the short version of the College Majors Preference assessment (Short CMPA) to match the results of whether the 50 college majors would be in one's top three, as determined by the Long CMPA. The results reveal that MNN significantly outperforms the simple-sum ranking method (i.e., ranking the 50 majors' subscale scores) in targeting recall (.95 vs. .68) and precision (.53 vs. .38), while gaining an additional 3% in accuracy (.94 vs. .91). These findings suggest that, when executed properly, MNN can be a flexible and practical tool for scoring numerous traits and addressing various use foci.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 1","pages":"120-144"},"PeriodicalIF":1.4,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143689091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tong Wu, Stella Y. Kim, Carl Westine, Michelle Boyer
{"title":"IRT Observed-Score Equating for Rater-Mediated Assessments Using a Hierarchical Rater Model","authors":"Tong Wu, Stella Y. Kim, Carl Westine, Michelle Boyer","doi":"10.1111/jedm.12425","DOIUrl":"https://doi.org/10.1111/jedm.12425","url":null,"abstract":"<p>While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating method that accounts for rater errors by utilizing item response theory (IRT) observed-score equating with a hierarchical rater model (HRM). Its effectiveness is compared to an IRT observed-score equating method using the generalized partial credit model across 16 rater combinations with varying levels of rater bias and variability. The results indicate that equating performance depends on the interaction between rater bias and variability across forms. Both the proposed and traditional methods demonstrated robustness in terms of bias and RMSE when rater bias and variability were similar between forms, with a few exceptions. However, when rater errors varied significantly across forms, the proposed method consistently produced more stable equating results. Differences in standard error between the methods were minimal under most conditions.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 1","pages":"145-171"},"PeriodicalIF":1.4,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}