{"title":"The One-Parameter Logistic Model Can Be True With Zero Probability for a Unidimensional Measuring Instrument: How One Could Go Wrong Removing Items Not Satisfying the Model.","authors":"Tenko Raykov, Bingsheng Zhang","doi":"10.1177/00131644251345120","DOIUrl":"10.1177/00131644251345120","url":null,"abstract":"<p><p>This note is concerned with the chance of the one-parameter logistic (1PL-) model or the Rasch model being true for a unidimensional multi-item measuring instrument. It is pointed out that if a single dimension underlies a scale consisting of dichotomous items, then the probability of either model being correct for that scale can be zero. The question is then addressed, what the consequences could be of removing items not following these models. Using a large number of simulated data sets, a pair of empirically relevant settings is presented where such item elimination can be problematic. Specifically, dropping items from a unidimensional instrument due to them not satisfying the 1PL-model, or the Rasch model, can yield potentially seriously misleading ability estimates with increased standard errors and prediction error with respect to the latent trait. Implications for educational and behavioral research are discussed.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251345120"},"PeriodicalIF":2.3,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12328337/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144816062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jared M Block, Steven P Reise, Keith F Widaman, Amanda K Montoya, David W Loring, Laura Glass Umfleet, Russell M Bauer, Joseph M Gullett, Brittany Wolff, Daniel L Drane, Kristen Enriquez, Robert M Bilder
{"title":"Model-Based Person Fit Statistics Applied to the Wechsler Adult Intelligence Scale IV.","authors":"Jared M Block, Steven P Reise, Keith F Widaman, Amanda K Montoya, David W Loring, Laura Glass Umfleet, Russell M Bauer, Joseph M Gullett, Brittany Wolff, Daniel L Drane, Kristen Enriquez, Robert M Bilder","doi":"10.1177/00131644251339444","DOIUrl":"10.1177/00131644251339444","url":null,"abstract":"<p><p>An important task in clinical neuropsychology is to evaluate whether scores obtained on a test battery, such as the Wechsler Adult Intelligence Scale Fourth Edition (WAIS-IV), can be considered \"credible\" or \"valid\" for a particular patient. Such evaluations are typically made based on responses to performance validity tests (PVTs). As a complement to PVTs, we propose that WAIS-IV profiles also be evaluated using a residual-based M-distance ( <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> ) person fit statistic. Large <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> values flag profiles that are inconsistent with the factor analytic model underlying the interpretation of test scores. We first established a well-fitting model with four correlated factors for 10 core WAIS-IV subtests derived from the standardization sample. Based on this model, we then performed a Monte Carlo simulation to evaluate whether a hypothesized sampling distribution for <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> was accurate and whether <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> was computable, under different degrees of missing subtest scores. We found that when the number of subtests administered was less than 8, <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> could not be computed around 25% of the time. When computable, <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> conformed to a <math> <mrow> <msup><mrow><mi>χ</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> distribution with degrees of freedom equal to the number of tests minus the number of factors. Demonstration of the <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> index in a large sample of clinical cases was also provided. Findings highlight the potential utility of the <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> index as an adjunct to PVTs, offering clinicians an additional method to evaluate WAIS-IV test profiles and improve the accuracy of neuropsychological evaluations.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251339444"},"PeriodicalIF":2.3,"publicationDate":"2025-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12321812/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144793789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Disentangling Qualitatively Different Faking Strategies in High-Stakes Personality Assessments: A Mixture Extension of the Multidimensional Nominal Response Model.","authors":"Timo Seitz, Ö Emre C Alagöz, Thorsten Meiser","doi":"10.1177/00131644251341843","DOIUrl":"10.1177/00131644251341843","url":null,"abstract":"<p><p>High-stakes personality assessments are often compromised by faking, where test-takers distort their responses according to social desirability. Many previous models have accounted for faking by modeling an additional latent dimension that quantifies each test-taker's degree of faking. Such models assume a homogeneous response strategy among all test-takers, reflected in a measurement model in which substantive traits and faking jointly influence item responses. However, such a model will be misspecified if, for some test-takers, item responding is only a function of substantive traits or only a function of faking. To address this limitation, we propose a mixture modeling extension of the multidimensional nominal response model (M-MNRM) that can be used to account for qualitatively different response strategies and to model relationships of strategy use with external variables. In a simulation study, the M-MNRM exhibited good parameter recovery and high classification accuracy across multiple conditions. Analyses of three empirical high-stakes datasets provided evidence for the consistent presence of the specified latent classes in different personnel selection contexts, emphasizing the importance of accounting for such kind of response behavior heterogeneity in high-stakes assessment data. We end the article with a discussion of the model's utility for psychological measurement.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251341843"},"PeriodicalIF":2.3,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12310618/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144774941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ming Li, Hong Jiao, Tianyi Zhou, Nan Zhang, Sydney Peters, Robert W Lissitz
{"title":"Item Difficulty Modeling Using Fine-tuned Small and Large Language Models.","authors":"Ming Li, Hong Jiao, Tianyi Zhou, Nan Zhang, Sydney Peters, Robert W Lissitz","doi":"10.1177/00131644251344973","DOIUrl":"10.1177/00131644251344973","url":null,"abstract":"<p><p>This study investigates methods for item difficulty modeling in large-scale assessments using both small and large language models (LLMs). We introduce novel data augmentation strategies, including augmentation on the fly and distribution balancing, that surpass benchmark performances, demonstrating their effectiveness in mitigating data imbalance and improving model performance. Our results showed that fine-tuned small language models (SLMs) such as Bidirectional Encoder Representations from Transformers (BERT) and RoBERTa yielded lower root mean squared error than the first-place model in the BEA 2024 Shared Task competition, whereas domain-specific models like BioClinicalBERT and PubMedBERT did not provide significant improvements due to distributional gaps. Majority voting among SLMs enhanced prediction accuracy, reinforcing the benefits of ensemble learning. LLMs, such as GPT-4, exhibited strong generalization capabilities but struggled with item difficulty prediction, likely due to limited training data and the absence of explicit difficulty-related context. Chain-of-thought prompting and rationale generation approaches were explored but did not yield substantial improvements, suggesting that additional training data or more sophisticated reasoning techniques may be necessary. Embedding-based methods, particularly using NV-Embed-v2, showed promise but did not outperform our best augmentation strategies, indicating that capturing nuanced difficulty-related features remains a challenge.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251344973"},"PeriodicalIF":2.1,"publicationDate":"2025-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12230038/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144590702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Historical Measurement Information Can Be Used to Improve Estimation of Structural Parameters in Structural Equation Models With Small Samples.","authors":"James Ohisei Uanhoro, Olushola O Soyoye","doi":"10.1177/00131644251330851","DOIUrl":"10.1177/00131644251330851","url":null,"abstract":"<p><p>This study investigates the incorporation of historical measurement information into structural equation models (SEM) with small samples to enhance the estimation of structural parameters. Given the availability of published factor analysis results with loading estimates and standard errors for popular scales, researchers may use this historical information as informative priors in Bayesian SEM (BSEM). We focus on estimating the correlation between two constructs using BSEM after generating data with significant bias in the Pearson correlation of their sum scores due to measurement error. Our findings indicate that incorporating historical information on measurement parameters as priors can improve the accuracy of correlation estimates, mainly when the true correlation is small-a common scenario in psychological research. Priors derived from meta-analytic estimates were especially effective, providing high accuracy and acceptable coverage. However, when the true correlation is large, weakly informative priors on all parameters yield the best results. These results suggest leveraging historical measurement information in BSEM can enhance structural parameter estimation.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251330851"},"PeriodicalIF":2.1,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12170579/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144324766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Huang, M David Miller, Anne Corinne Huggins-Manley, Walter L Leite, Herman T Knopf, Albert D Ritzhaupt
{"title":"Evaluating the Performance of a Regularized Differential Item Functioning Method for Testlet-Based Polytomous Items.","authors":"Jing Huang, M David Miller, Anne Corinne Huggins-Manley, Walter L Leite, Herman T Knopf, Albert D Ritzhaupt","doi":"10.1177/00131644251342512","DOIUrl":"10.1177/00131644251342512","url":null,"abstract":"<p><p>This study investigated the effect of testlets on regularization-based differential item functioning (DIF) detection in polytomous items, focusing on the generalized partial credit model with lasso penalization (GPCMlasso) DIF method. Five factors were manipulated: sample size, magnitude of testlet effect, magnitude of DIF, number of DIF items, and type of DIF-inducing covariates. Model performance was evaluated using false-positive rate (FPR) and true-positive rate (TPR). Results showed that the simulation had effective control of FPR across conditions, while the TPR was differentially influenced by the manipulated factors. Generally, the small testlet effect did not noticeably affect the GPCMlasso model's performance regarding FPR and TPR. The findings provide evidence of the effectiveness of the GPCMlasso method for DIF detection in polytomous items when testlets were present. The implications for future research and limitations were also discussed.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251342512"},"PeriodicalIF":2.1,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12126468/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144207999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Qiao, Akihito Kamata, Yusuf Kara, Cornelis Potgieter, Joseph F T Nese
{"title":"Beta-Binomial Model for Count Data: An Application in Estimating Model-Based Oral Reading Fluency.","authors":"Xin Qiao, Akihito Kamata, Yusuf Kara, Cornelis Potgieter, Joseph F T Nese","doi":"10.1177/00131644251335914","DOIUrl":"10.1177/00131644251335914","url":null,"abstract":"<p><p>In this article, the beta-binomial model for count data is proposed and demonstrated in terms of its application in the context of oral reading fluency (ORF) assessment, where the number of words read correctly (WRC) is of interest. Existing studies adopted the binomial model for count data in similar assessment scenarios. The beta-binomial model, however, takes into account extra variability in count data that have been neglected by the binomial model. Therefore, it accommodates potential overdispersion in count data compared to the binomial model. To estimate model-based ORF scores, WRC and response times were jointly modeled. The full Bayesian Markov chain Monte Carlo method was adopted for model parameter estimation. A simulation study showed adequate parameter recovery of the beta-binomial model and evaluated the performance of model fit indices in selecting the true data-generating models. Further, an empirical analysis illustrated the application of the proposed model using a dataset from a computerized ORF assessment. The obtained findings were consistent with the simulation study and demonstrated the utility of adopting the beta-binomial model for count-type item responses from assessment data.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251335914"},"PeriodicalIF":2.1,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12125017/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144198554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bayesian Thurstonian IRT Modeling: Logical Dependencies as an Accurate Reflection of Thurstone's Law of Comparative Judgment.","authors":"Hannah Heister, Philipp Doebler, Susanne Frick","doi":"10.1177/00131644251335586","DOIUrl":"10.1177/00131644251335586","url":null,"abstract":"<p><p>Thurstonian item response theory (Thurstonian IRT) is a well-established approach to latent trait estimation with forced choice data of arbitrary block lengths. In the forced choice format, test takers rank statements within each block. This rank is coded with binary variables. Since each rank is awarded exactly once per block, stochastic dependencies arise, for example, when options A and B have ranks 1 and 3, C must have rank 2 in a block of length 3. Although the original implementation of the Thurstonian IRT model can recover parameters well, it is not completely true to the mathematical model and Thurstone's law of comparative judgment, as impossible binary answer patterns have a positive probability. We refer to this problem as stochastic dependencies and it is due to unconstrained item intercepts. In addition, there are redundant binary comparisons resulting in what we call logical dependencies, for example, if within a block <math><mrow><mi>A</mi> <mo><</mo> <mi>B</mi></mrow> </math> and <math><mrow><mi>B</mi> <mo><</mo> <mi>C</mi></mrow> </math> , then <math><mrow><mi>A</mi> <mo><</mo> <mi>C</mi></mrow> </math> must follow and a binary variable for <math><mrow><mi>A</mi> <mo><</mo> <mi>C</mi></mrow> </math> is not needed. Since current Markov Chain Monte Carlo approaches to Bayesian computation are flexible and at the same time promise correct small sample inference, we investigate an alternative Bayesian implementation of the Thurstonian IRT model considering both stochastic and logical dependencies. We show analytically that the same parameters maximize the posterior likelihood, regardless of the presence or absence of redundant binary comparisons. A comparative simulation reveals a large reduction in computational effort for the alternative implementation, which is due to respecting both dependencies. Therefore, this investigation suggests that when fitting the Thurstonian IRT model, all dependencies should be considered.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251335586"},"PeriodicalIF":2.1,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12125010/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144198553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Biclustering to Detect Cheating in Real Time on Mixed-Format Tests.","authors":"Hyeryung Lee, Walter P Vispoel","doi":"10.1177/00131644251333143","DOIUrl":"10.1177/00131644251333143","url":null,"abstract":"<p><p>We evaluated a real-time biclustering method for detecting cheating on mixed-format assessments that included dichotomous, polytomous, and multi-part items. Biclustering jointly groups examinees and items by identifying subgroups of test takers who exhibit similar response patterns on specific subsets of items. This method's flexibility and minimal assumptions about examinee behavior make it computationally efficient and highly adaptable. To further finetune accuracy and reduce false positives in real-time detection, enhanced statistical significance tests were incorporated into the illustrated algorithms. Two simulation studies were conducted to assess detection across varying testing conditions. In the first study, the method effectively detected cheating on tests composed entirely of either dichotomous or non-dichotomous items. In the second study, we examined tests with varying mixed item formats and again observed strong detection performance. In both studies, detection performance was examined at each timestamp in real time and evaluated under three varying conditions: proportion of cheaters, cheating group size, and proportion of compromised items. Across conditions, the method demonstrated strong computational efficiency, underscoring its suitability for real-time applications. Overall, these results highlight the adaptability, versatility, and effectiveness of biclustering in detecting cheating in real time while maintaining low false-positive rates.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251333143"},"PeriodicalIF":2.1,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12104213/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144156794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Deep Reinforcement Learning to Decide Test Length.","authors":"James Zoucha, Igor Himelfarb, Nai-En Tang","doi":"10.1177/00131644251332972","DOIUrl":"https://doi.org/10.1177/00131644251332972","url":null,"abstract":"<p><p>This study explored the application of deep reinforcement learning (DRL) as an innovative approach to optimize test length. The primary focus was to evaluate whether the current length of the National Board of Chiropractic Examiners Part I Exam is justified. By modeling the problem as a combinatorial optimization task within a Markov Decision Process framework, an algorithm capable of constructing test forms from a finite set of items while adhering to critical structural constraints, such as content representation and item difficulty distribution, was used. The findings reveal that although the DRL algorithm was successful in identifying shorter test forms that maintained comparable ability estimation accuracy, the existing test length of 240 items remains advisable as we found shorter test forms did not maintain structural constraints. Furthermore, the study highlighted the inherent adaptability of DRL to continuously learn about a test-taker's latent abilities and dynamically adjust to their response patterns, making it well-suited for personalized testing environments. This dynamic capability supports real-time decision-making in item selection, improving both efficiency and precision in ability estimation. Future research is encouraged to focus on expanding the item bank and leveraging advanced computational resources to enhance the algorithm's search capacity for shorter, structurally compliant test forms.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251332972"},"PeriodicalIF":2.1,"publicationDate":"2025-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12049363/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143988676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}