Pere J Ferrando, Fabia Morales-Vives, Silvia Duran-Bonavila, David Navarro-González
{"title":"Assessing the Unconditional and Conditional External Validity of Noncognitive Test Scores: A Unifying Model-Based Proposal.","authors":"Pere J Ferrando, Fabia Morales-Vives, Silvia Duran-Bonavila, David Navarro-González","doi":"10.1177/00131644261440168","DOIUrl":"https://doi.org/10.1177/00131644261440168","url":null,"abstract":"<p><p>Evidence of external validity based on individual score estimates is still relevant in many psychometric applications. From a model-based perspective, however, the topic appears to have been rather neglected in recent decades. Thus, in structural equation modelling (SEM), this evidence is sought to be obtained structurally, bypassing the scoring stage. And, in item response theory (IRT), the score interest mostly focuses on internal properties. Taking this state of affairs into account, this paper develops and proposes a model-based approach, intended for noncognitive measures, that combines SEM and IRT developments, and which allows a detailed assessment of the external validity of a class of score estimates to be carried out. The starting point is a general extended model that also includes the relevant external variables. From this general model, four well-known extended IRT models can be derived and fitted at the structural level. Next, on the basis of the structural results, a series of unconditional (population-dependent) and conditional (population-independent) indices that describe the model-implied relation between the score estimates and each external variable are developed and proposed. The practical relevance of the proposal is discussed mainly around three applications: assessing model appropriateness, obtaining point and interval prediction estimates at the individual level, and shortening a test while optimizing the external validity of the resulting version. The functioning of the proposal is illustrated using a real-data example.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261440168"},"PeriodicalIF":2.3,"publicationDate":"2026-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13133035/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147812434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiplicity Control for Structural Equation Modeling in <i>lavaan</i>: A Practical Workflow for False Discovery Rate Adjustment.","authors":"Giuseppe Corbelli","doi":"10.1177/00131644261442141","DOIUrl":"https://doi.org/10.1177/00131644261442141","url":null,"abstract":"<p><p>Structural equation modeling (SEM) is widely used in educational and behavioral research, but applied SEM often involves simultaneous tests of many structural paths. When many coefficients are evaluated at nominal thresholds, the probability of false positives and the expected number of false discoveries can be substantial even when global fit indices indicate close fit, encouraging substantive interpretation of chance findings. Building on prior work on multiplicity control in SEM, this article presents a practical workflow for false discovery rate (FDR) adjustment of families of SEM parameter tests obtained from fitted <i>lavaan</i> model objects, including the dependence-robust Benjamini-Yekutieli (BY) procedure, and provides an R implementation to support routine use. In a Monte Carlo study (1,000 replications; <i>N</i> = 500) with nine latent factors, a correctly specified measurement model, and an overspecified structural model with 33 candidate regressions (8 non-zero), nominal <i>p</i> < .05 produced at least one false positive in 69.3% of samples and a mean of 1.182 false-positive paths. BY adjustment reduced the mean number of false positives to 0.073, while the mean number of detected true effects declined from 6.358 to 5.857. A sensitivity analysis across three dependency conditions indicated that BY-FDR was more robust to the direction and magnitude of parameter dependence, whereas BH's false-positive control weakened under negative dependence. These results suggest that dependence-robust FDR adjustment can be integrated into a standard SEM workflow with <i>lavaan</i> in R, and may substantially reduce false positives with a modest reduction in detected true effects.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261442141"},"PeriodicalIF":2.3,"publicationDate":"2026-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13124904/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147812479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Controlling the False Discovery Rate in DIF Detection With e-Values: Evidence From Multidimensional and Testlet Simulations.","authors":"Shan Huang, David Goretzko","doi":"10.1177/00131644261433236","DOIUrl":"10.1177/00131644261433236","url":null,"abstract":"<p><p>This study presents the first application of e-value-based false discovery rate (FDR) control to Differential Item Functioning (DIF) detection, addressing long-standing limitations of <i>p</i>-value-based approaches when model assumptions are violated-for example, under multidimensionality, local item dependence, or extreme sample sizes. Two comprehensive simulation studies were conducted to evaluate e-BH (the e-value analogue of BH) procedures, using K-fold and Multisplit likelihood-ratio e-values, under (a) multidimensional contamination and (b) testlet-based local dependence. Across both scenarios, e-BH consistently provided stronger and more stable control of Type I error, FDR, and family-wise error rate (FWER) than classical procedures such as Benjamini-Hochberg (BH) and Holm. Even under severe model misspecification, e-BH maintained substantially lower false-positive rates while remaining relatively competitive in terms of Type II error. A key finding concerns sample size: classical <i>p</i>-value methods exhibited inflation of Type I error as N increased, whereas e-BH preserved stable error control due to its model-agnostic calibration. An empirical application using Progress in International Reading Literacy Study (PIRLS) data further demonstrated that e-BH produces a more defensible and operationally sustainable set of DIF flags than traditional approaches. Together, these results establish e-values as a powerful and robust evidential tool for DIF detection in modern assessment contexts.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261433236"},"PeriodicalIF":2.3,"publicationDate":"2026-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13086766/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147722237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nathaniel M Voss, Felix Y Wu, Anoop A Javalagi, Harrison J Kell
{"title":"Integrating Ensemble Clustering and Text Embeddings for Estimating the Factor Loadings of Self-Report Scales.","authors":"Nathaniel M Voss, Felix Y Wu, Anoop A Javalagi, Harrison J Kell","doi":"10.1177/00131644261430762","DOIUrl":"https://doi.org/10.1177/00131644261430762","url":null,"abstract":"<p><p>Advances in large language models can provide opportunities to evaluate the characteristics of scales prior to data collection. In this study, we explore if item text can be used to predict a scale's psychometric properties. Specifically, we examine if clustering consensus (i.e., the frequency by which items are grouped with other items from the same underlying factor across multiple clustering algorithms), and a cosine similarity metric (i.e., the semantic similarity of items to other items from the same factor), can be used to predict exploratory factor analysis (EFA) factor loadings. Across six scales with varying sample sizes, number of factors/items, we found that both the cosine similarity and ensemble clustering consensus methods predicted factor loading values. While the methods share some conceptual and empirical overlap, and results vary by scale, the ensemble clustering approach explains incremental variance above and beyond cosine similarity in predicting factor loadings. Using both methods in conjunction can be a useful way to identify problematic items prior to data collection and help researchers develop more optimal scales from the onset, thereby potentially saving time, resources, and increasing the likelihood of developing sound measures.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261430762"},"PeriodicalIF":2.3,"publicationDate":"2026-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13076461/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147688791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martijn Schoenmakers, Jesper Tijmstra, Jeroen Kornelis Vermunt, Maria Bolsinova
{"title":"How Extreme Is It Anyways?: An Empirical Investigation Into the Prevalence and Strength of Extreme Response Style.","authors":"Martijn Schoenmakers, Jesper Tijmstra, Jeroen Kornelis Vermunt, Maria Bolsinova","doi":"10.1177/00131644261435119","DOIUrl":"https://doi.org/10.1177/00131644261435119","url":null,"abstract":"<p><p>Extreme response style (ERS), the tendency of participants to endorse the extreme categories of an item partially independent of item content, has repeatedly been found to decrease the validity of Likert-type scale results. For this reason, many IRT models have been developed that attempt to detect and correct for ERS. Despite the substantive literature on ERS and modeling of ERS, several important questions remain. To date, there is no clear estimate of how often ERS occurs in practice across a variety of scales and populations. In addition, there is little guidance on what item parameters for ERS models are commonly found in empirical data, while this information is crucial to inform future methodological studies utilizing ERS models. Finally, there is only limited information available on which ERS models tend to fit the data best. The current study sets out to address these three issues by analyzing data from the Programme for International Student Assessment using a generalized partial credit model, several multidimensional nominal response models, and several IRTree models. Results indicate an extremely high prevalence of ERS across scales, populations, and timepoints. Item parameters for future methodological studies are presented, and a general preference for IRTree models over MNRM models is found in many datasets. Implications for futures studies are discussed, and recommendations for practice are made.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261435119"},"PeriodicalIF":2.3,"publicationDate":"2026-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13068779/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147671374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Faking in High-Stakes Personality Assessments: A Response-Time-Based Latent Response Mixture Modeling Approach.","authors":"Timo Seitz, Esther Ulitzsch","doi":"10.1177/00131644261422169","DOIUrl":"10.1177/00131644261422169","url":null,"abstract":"<p><p>When personality assessments are employed in high-stakes contexts, there is the risk that test-takers provide overly positive descriptions of themselves. This response bias is known as faking and has often been addressed in latent variable models through an additional dimension capturing each test-taker's faking degree. Such models typically assume a homogeneous response strategy for all test-takers, with substantive traits and faking jointly influencing responses to all items. In this article, we present a latent response mixture item response theory (IRT) model of faking that accounts for changes in test-takers' response strategies over the course of the assessment. The model translates theoretical considerations about test-taker behavior into different model components for item responses and corresponding item-level response times (RT), thereby allowing to account for, identify, and investigate different faking-related response strategies on the person-by-item level. In a parameter recovery study, we found that the model parameters can be estimated well under realistic conditions. Also, we applied the model to an empirical dataset (<i>N</i> = 1,824) from a job application context, showcasing its utility in real high-stakes assessment data. We conclude the article by discussing the role of the model for psychological measurement as well as substantive research.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261422169"},"PeriodicalIF":2.3,"publicationDate":"2026-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12999537/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147497889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joshua B Gilbert, William S Young, Zachary Himmelsbach, Esther Ulitzsch, Benjamin W Domingue
{"title":"Conditional Dependencies Between Response Time and Item Discrimination: An Item-Level Meta-Analysis.","authors":"Joshua B Gilbert, William S Young, Zachary Himmelsbach, Esther Ulitzsch, Benjamin W Domingue","doi":"10.1177/00131644261426972","DOIUrl":"10.1177/00131644261426972","url":null,"abstract":"<p><p>The use of process data, such as response time (RT) in psychometrics, has generally focused on the relationship between speed and accuracy. The potential relationships between RT and item discrimination remain less explored. In this study, we propose a model for simultaneously estimating the relationships between RT and item discrimination at the person, item, and person-by-item (residual) levels and illustrate our approach through an item-level meta-analysis of 40 empirical data sets comprising 1.84 million item responses. We find no evidence of average differences in item discrimination between items of different time intensity or persons of different average RT, while residual RT strongly and negatively predicts item discrimination (pooled coef. = -.27% per 1% difference in RT, <i>SE</i> = .04, <math><mrow><mi>τ</mi></mrow> </math> = .17). While heterogeneity is high, we find little evidence of moderation by overall data set characteristics. Flexible generalized additive models show that the relationship between residual RT and item discrimination is generally curvilinear, with discrimination maximized just below average RT and minimized at the extremes. Our results suggest that RT data can provide insights into the measurement properties of educational and psychological assessments, but that the relationships between RT and item discrimination are highly variable.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261426972"},"PeriodicalIF":2.3,"publicationDate":"2026-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12995739/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147485016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimating Trends With Differential Item Functioning: A Comparison of Five IRT-Based Approaches.","authors":"Oskar Engels, Oliver Lüdtke, Alexander Robitzsch","doi":"10.1177/00131644251408818","DOIUrl":"https://doi.org/10.1177/00131644251408818","url":null,"abstract":"<p><p>In longitudinal assessments, tests are frequently used to estimate trends over time. However, when item parameters lack invariance, time-point comparisons can be distorted, necessitating appropriate statistical methods to achieve accurate estimation. This study compares trend estimates using the two-parameter logistic (2PL) model under item parameter drift (IPD) across five trend-estimation approaches for two time points: First, concurrent calibration, which jointly estimates item parameters across multiple time points. Second, fixed calibration, which estimates item parameters at a single time point and fixes them at the other time point. Third, robust linking with Haberman and Haebara as linking methods with <math> <mrow> <msub><mrow><mi>L</mi></mrow> <mrow><mi>p</mi></mrow> </msub> </mrow> </math> or <math> <mrow> <msub><mrow><mi>L</mi></mrow> <mrow><mn>0</mn></mrow> </msub> </mrow> </math> losses. Fourth, non-invariant items are detected using likelihood-ratio tests or the root mean square deviation statistic with fixed or data-driven cutoffs, and trend estimates are then recomputed using only the detected invariant items under partial invariance. Fifth, regularized estimation under a smooth Bayesian information criterion (SBIC) is applied, shrinking small or null IPD effects toward zero while estimating all others as nonzero. Bias and relative root mean square error (RMSE) were evaluated for the mean and <i>SD</i> at T2. An empirical example using synthetic longitudinal reading data, applying the trend-estimation approaches, is provided. The results indicate that the regularized estimation with SBIC performed best across conditions, maintaining low bias and RMSE, followed by robust linking methods. Specifically, Haberman linking with the <math> <mrow> <msub><mrow><mi>L</mi></mrow> <mrow><mn>0</mn></mrow> </msub> </mrow> </math> loss function showed superior performance under unbalanced IPD, outperforming the partial invariance approaches. Concurrent and fixed calibration showed the poorest trend recovery under unbalanced IPD conditions.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251408818"},"PeriodicalIF":2.3,"publicationDate":"2026-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12987755/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147462744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discriminating Between Attribute, Item-Position, and Wording Effects by the Congeneric and Tau-Equivalent Confirmatory Factor Analysis Models.","authors":"Karl Schweizer, Xuezhu Ren, Tengfei Wang","doi":"10.1177/00131644261419028","DOIUrl":"https://doi.org/10.1177/00131644261419028","url":null,"abstract":"<p><p>The capability of confirmatory factor analysis to discriminate common systematic variation of attribute, item-position, and wording effects was investigated using the congeneric and tau-equivalent models. The simulated data generated according to four approaches included gradually increased amounts of item-position or wording effect variation while the amount of attribute variation was kept constant. The congeneric model always signified good model fit independently of the type and amount of additional common systematic variation, that is, there was no discrimination. In applications of the tau-equivalent model, the increase of the item-position or wording effect variation led to the change from indicating good fit to bad model fit, that is, there was negative discrimination. In contrast, the additionally considered two-factor tau model discriminated positively. As a consequence of these results, we recommend the pre-screening of data for method effects.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261419028"},"PeriodicalIF":2.3,"publicationDate":"2026-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12979218/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147462748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimation of Conditional Standard Errors of Measurement for MLE Scores in MST.","authors":"Yuanyuan J Stirn, Won-Chan Lee","doi":"10.1177/00131644261420391","DOIUrl":"10.1177/00131644261420391","url":null,"abstract":"<p><p>This paper proposes an information-based analytic method for calculating the conditional standard error of measurement (CSEM) in multistage testing (MST) using maximum likelihood estimation. The accuracy of the proposed method was evaluated by comparing CSEMs computed using the analytic method with those obtained from simulation across the same four MST designs. The results show that analytic and simulation-based CSEMs converge as test length increases, indicating that the proposed method provides a reliable approximation for longer tests. However, shorter tests and more complex MST designs require additional items to achieve comparable accuracy. The study also compared the proposed method with Park et al.'s analytic approach. Practical implications of the proposed method are discussed.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261420391"},"PeriodicalIF":2.3,"publicationDate":"2026-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12945742/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147324610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}