{"title":"Optimal Number of Replications for Obtaining Stable Dynamic Fit Index Cutoffs.","authors":"Xinran Liu, Daniel McNeish","doi":"10.1177/00131644241290172","DOIUrl":"10.1177/00131644241290172","url":null,"abstract":"<p><p>Factor analysis is commonly used in behavioral sciences to measure latent constructs, and researchers routinely consider approximate fit indices to ensure adequate model fit and to provide important validity evidence. Due to a lack of generalizable fit index cutoffs, methodologists suggest simulation-based methods to create customized cutoffs that allow researchers to assess model fit more accurately. However, simulation-based methods are computationally intensive. An open question is: How many simulation replications are needed for these custom cutoffs to stabilize? This Monte Carlo simulation study focuses on one such simulation-based method-dynamic fit index (DFI) cutoffs-to determine the optimal number of replications for obtaining stable cutoffs. Results indicated that the DFI approach generates stable cutoffs with 500 replications (the currently recommended number), but the process can be more efficient with fewer replications, especially in simulations with categorical data. Using fewer replications significantly reduces the computational time for determining cutoff values with minimal impact on the results. For one-factor or three-factor models, results suggested that in most conditions 200 DFI replications were optimal for balancing fit index cutoff stability and computational efficiency.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241290172"},"PeriodicalIF":2.1,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562945/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Invariance: What Does Measurement Invariance Allow Us to Claim?","authors":"John Protzko","doi":"10.1177/00131644241282982","DOIUrl":"10.1177/00131644241282982","url":null,"abstract":"<p><p>Measurement involves numerous theoretical and empirical steps-ensuring our measures are operating the same in different groups is one step. Measurement invariance occurs when the factor loadings and item intercepts or thresholds of a scale operate similarly for people at the same level of the latent variable in different groups. This is commonly assumed to mean the scale is measuring the same thing in those groups. Here we test the assumption of extending measurement invariance to mean common measurement by randomly assigning American adults (<i>N</i> = 1500) to fill out scales assessing a coherent factor (search for meaning in life) or a nonsense factor measuring nothing. We find a nonsense scale with items measuring nothing shows strong measurement invariance with the original scale, is reliable, and covaries with other constructs. We show measurement invariance can occur without measurement. Thus, we cannot infer that measurement invariance means one is measuring the same thing, it may be a necessary but not a sufficient condition.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241282982"},"PeriodicalIF":2.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562939/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting Differential Item Functioning Using Response Time.","authors":"Qizhou Duan, Ying Cheng","doi":"10.1177/00131644241280400","DOIUrl":"10.1177/00131644241280400","url":null,"abstract":"<p><p>This study investigated uniform differential item functioning (DIF) detection in response times. We proposed a regression analysis approach with both the working speed and the group membership as independent variables, and logarithm transformed response times as the dependent variable. Effect size measures such as Δ <math> <mrow> <msup><mrow><mi>R</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> and percentage change in regression coefficients in conjunction with the statistical significance tests were used to flag DIF items. A simulation study was conducted to assess the performance of three DIF detection criteria: (a) significance test, (b) significance test with Δ <math> <mrow> <msup><mrow><mi>R</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> , and (c) significance test with the percentage change in regression coefficients. The simulation study considered factors such as sample sizes, proportion of the focal group in relation to total sample size, number of DIF items, and the amount of DIF. The results showed that the significance test alone was too strict; using the percentage change in regression coefficients as an effect size measure reduced the flagging rate when the sample size was large, but the effect was inconsistent across different conditions; using Δ<i>R</i> <sup>2</sup> with significance test reduced the flagging rate and was fairly consistent. The PISA 2018 data were used to illustrate the performance of the proposed method in a real dataset. Furthermore, we provide guidelines for conducting DIF studies with response time.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241280400"},"PeriodicalIF":2.1,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562889/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142650502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tobias Alfers, Georg Gittler, Esther Ulitzsch, Steffi Pohl
{"title":"Assessing the Speed-Accuracy Tradeoff in Psychological Testing Using Experimental Manipulations.","authors":"Tobias Alfers, Georg Gittler, Esther Ulitzsch, Steffi Pohl","doi":"10.1177/00131644241271309","DOIUrl":"10.1177/00131644241271309","url":null,"abstract":"<p><p>The speed-accuracy tradeoff (SAT), where increased response speed often leads to decreased accuracy, is well established in experimental psychology. However, its implications for psychological assessments, especially in high-stakes settings, remain less understood. This study presents an experimental approach to investigate the SAT within a high-stakes spatial ability assessment. By manipulating instructions in a within-subjects design to induce speed variations in a large sample (<i>N</i> = 1,305) of applicants for an air traffic controller training program, we demonstrate the feasibility of manipulating working speed. Our findings confirm the presence of the SAT for most participants, suggesting that traditional ability scores may not fully reflect performance in high-stakes assessments. Importantly, we observed individual differences in the SAT, challenging the assumption of uniform SAT functions across test takers. These results highlight the complexity of interpreting high-stakes assessment outcomes and the influence of test conditions on performance dynamics. This study offers a valuable addition to the methodological toolkit for assessing the intraindividual relationship between speed and accuracy in psychological testing (including SAT research), providing a controlled approach while acknowledging the need to address potential confounders. Future research may apply this method across various cognitive domains, populations, and testing contexts to deepen our understanding of the SAT's broader implications for psychological measurement.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241271309"},"PeriodicalIF":2.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562887/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Latent Structure Examination of Behavioral Measuring Instruments in Complex Empirical Settings.","authors":"Tenko Raykov, Khaled Alkherainej","doi":"10.1177/00131644241281049","DOIUrl":"10.1177/00131644241281049","url":null,"abstract":"<p><p>A multiple-step procedure is outlined that can be used for examining the latent structure of behavior measurement instruments in complex empirical settings. The method permits one to study their latent structure after assessing the need to account for clustering effects and the necessity of its examination within individual levels of fixed factors, such as gender or group membership of substantive relevance. The approach is readily applicable with binary or binary-scored items using popular and widely available software. The described procedure is illustrated with empirical data from a student behavior screening instrument.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241281049"},"PeriodicalIF":2.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562891/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interpretation of the Standardized Mean Difference Effect Size When Distributions Are Not Normal or Homoscedastic.","authors":"Larry V Hedges","doi":"10.1177/00131644241278928","DOIUrl":"10.1177/00131644241278928","url":null,"abstract":"<p><p>The standardized mean difference (sometimes called Cohen's d) is an effect size measure widely used to describe the outcomes of experiments. It is mathematically natural to describe differences between groups of data that are normally distributed with different means but the same standard deviation. In that context, it can be interpreted as determining several indexes of overlap between the two distributions. If the data are not approximately normally distributed or if they have substantially unequal standard deviations, the relation between d and overlap between distributions can be very different, and interpretations of d that apply when the data are normal with equal variances are unreliable.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241278928"},"PeriodicalIF":2.1,"publicationDate":"2024-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562970/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Effort-Moderated Item Response Theory Models by Evaluating a Two-Step Estimation Method and Multidimensional Variations on the Model.","authors":"Bowen Wang, Corinne Huggins-Manley, Huan Kuang, Jiawei Xiong","doi":"10.1177/00131644241280727","DOIUrl":"10.1177/00131644241280727","url":null,"abstract":"<p><p>Rapid-guessing behavior in data can compromise our ability to estimate item and person parameters accurately. Consequently, it is crucial to model data with rapid-guessing patterns in a way that can produce unbiased ability estimates. This study proposes and evaluates three alternative modeling approaches that follow the logic of the effort-moderated item response theory model (EM-IRT) to analyze response data with rapid-guessing responses. One is the two-step EM-IRT model, which utilizes the item parameters estimated by respondents without rapid-guessing behavior and was initially proposed by Rios and Soland without further investigation. The other two models are effort-moderated multidimensional models (EM-MIRT), which we introduce in this study and vary as both between-item and within-item structures. The advantage of the EM-MIRT model is to account for the underlying relationship between rapid-guessing propensity and ability. The three models were compared with the traditional EM-IRT model regarding the accuracy of parameter recovery in various simulated conditions. Results demonstrated that the two-step EM-IRT and between-item EM-MIRT model consistently outperformed the traditional EM-IRT model under various conditions, with the two-step EM-IRT estimation generally delivering the best performance, especially for ability and item difficulty parameters estimation. In addition, different rapid-guessing patterns (i.e., difficulty-based, changing state, and decreasing effort) did not affect the performance of the two-step EM-IRT model. Overall, the findings suggest that the EM-IRT model with the two-step parameter estimation method can be applied in practice for estimating ability in the presence of rapid-guessing responses due to its accuracy and efficiency. The between-item EM-MIRT model can be used as an alternative model when there is no significant mean difference in the ability estimates between examinees who exhibit rapid-guessing behavior and those who do not.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241280727"},"PeriodicalIF":2.1,"publicationDate":"2024-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562957/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Field-Testing Multiple-Choice Questions With AI Examinees: English Grammar Items.","authors":"Hotaka Maeda","doi":"10.1177/00131644241281053","DOIUrl":"10.1177/00131644241281053","url":null,"abstract":"<p><p>Field-testing is an essential yet often resource-intensive step in the development of high-quality educational assessments. I introduce an innovative method for field-testing newly written exam items by substituting human examinees with artificially intelligent (AI) examinees. The proposed approach is demonstrated using 466 four-option multiple-choice English grammar questions. Pre-trained transformer language models are fine-tuned based on the 2-parameter logistic (2PL) item response model to respond like human test-takers. Each AI examinee is associated with a latent ability θ, and the item text is used to predict response selection probabilities for each of the four response options. For the best modeling approach identified, the overall correlation between the true and predicted 2PL correct response probabilities was .82 (bias = 0.00, root mean squared error = 0.18). The study results were promising, showing that item response data generated from AI can be used to calculate item proportion correct, item discrimination, conduct item calibration with anchors, distractor analysis, dimensionality analysis, and latent trait scoring. However, the proposed approach did not achieve the level of accuracy obtainable with human examinee response data. If further refined, potential resource savings in transitioning from human to AI field-testing could be enormous. AI could shorten the field-testing timeline, prevent examinees from seeing low-quality field-test items in real exams, shorten test lengths, eliminate test security, item exposure, and sample size concerns, reduce overall cost, and help expand the item bank. Example Python code from this study is available on Github: https://github.com/hotakamaeda/ai_field_testing1.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241281053"},"PeriodicalIF":2.1,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562880/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Precision in Predicting Magnitude of Differential Item Functioning: An M-DIF Pretrained Model Approach.","authors":"Shan Huang, Hidetoki Ishii","doi":"10.1177/00131644241279882","DOIUrl":"10.1177/00131644241279882","url":null,"abstract":"<p><p>Despite numerous studies on the magnitude of differential item functioning (DIF), different DIF detection methods often define effect sizes inconsistently and fail to adequately account for testing conditions. To address these limitations, this study introduces the unified M-DIF model, which defines the magnitude of DIF as the difference in item difficulty parameters between reference and focal groups. The M-DIF model can incorporate various DIF detection methods and test conditions to form a quantitative model. The pretrained approach was employed to leverage a sufficiently representative large sample as the training set and ensure the model's generalizability. Once the pretrained model is constructed, it can be directly applied to new data. Specifically, a training dataset comprising 144 combinations of test conditions and 144,000 potential DIF items, each equipped with 29 statistical metrics, was used. We adopt the XGBoost method for modeling. Results show that, based on root mean square error (RMSE) and BIAS metrics, the M-DIF model outperforms the baseline model in both validation sets: under consistent and inconsistent test conditions. Across all 360 combinations of test conditions (144 consistent and 216 inconsistent with the training set), the M-DIF model demonstrates lower RMSE in 357 cases (99.2%), illustrating its robustness. Finally, we provided an empirical example to showcase the practical feasibility of implementing the M-DIF model.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241279882"},"PeriodicalIF":2.1,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562883/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using ROC Analysis to Refine Cut Scores Following a Standard Setting Process.","authors":"Dongwei Wang, Lisa A Keller","doi":"10.1177/00131644241278925","DOIUrl":"10.1177/00131644241278925","url":null,"abstract":"<p><p>In educational assessment, cut scores are often defined through standard setting by a group of subject matter experts. This study aims to investigate the impact of several factors on classification accuracy using the receiver operating characteristic (ROC) analysis to provide statistical and theoretical evidence when the cut score needs to be refined. Factors examined in the study include the sample distribution relative to the cut score, prevalence of the positive event, and cost ratio. Forty item responses were simulated for examinees of four sample distributions. In addition, the prevalence and cost ratio between false negatives and false positives were manipulated to examine their impacts on classification accuracy. The optimal cut score is identified using the Youden Index <i>J</i>. The results showed that the optimal cut score identified by the evaluation criterion tended to pull the cut score closer to the mode of the proficiency distribution. In addition, depending on the prevalence of the positive event and cost ratio, the optimal cut score shifts accordingly. With the item parameters used to simulate the data and the simulated sample distributions, it was found that when passing the exam is a low-prevalence event in the population, increasing the cut score operationally improves the classification; when passing the exam is a high-prevalence event, then cut score should be reduced to achieve optimality. As the cost ratio increases, the optimal cut score suggested by the evaluation criterion decreases. In three out of the four sample distributions examined in this study, increasing the cut score enhanced the classification, irrespective of the cost ratio when the prevalence in the population is 50%. This study provides statistical evidence when the cut score needs to be refined for policy reasons.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241278925"},"PeriodicalIF":2.1,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562877/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142650503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}