{"title":"Linear Factor Analytic Thurstonian Forced-Choice Models: Current Status and Issues","authors":"Markus T. Jansen, Ralf Schulze","doi":"10.1177/00131644231205011","DOIUrl":"https://doi.org/10.1177/00131644231205011","url":null,"abstract":"Thurstonian forced-choice modeling is considered to be a powerful new tool to estimate item and person parameters while simultaneously testing the model fit. This assessment approach is associated with the aim of reducing faking and other response tendencies that plague traditional self-report trait assessments. As a result of major recent methodological developments, the estimation of normative trait scores has become possible in addition to the computation of only ipsative scores. This opened up the important possibility of comparisons between individuals with forced-choice assessment procedures. With item response theory (IRT) methods, a multidimensional forced-choice (MFC) format has also been proposed to estimate individual scores. Customarily, items to assess different traits are presented in blocks, often triplets, in applications of the MFC, which is an efficient form of item presentation but also a simplification of the original models. The present study provides a comprehensive review of the present status of Thurstonian forced-choice models and their variants. Critical features of the current models, especially the block models, are identified and discussed. It is concluded that MFC modeling with item blocks is highly problematic and yields biased results. In particular, the often-recommended presentation of blocks with items that are keyed in different directions of a trait proves to be counterproductive considering the goal to reduce response tendencies. The consequences and implications of the highlighted issues are further discussed.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136069087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Thinking About Sum Scores Yet Again, Maybe the Last Time, We Don’t Know, Oh No . . .: A Comment on","authors":"Keith F. Widaman, William Revelle","doi":"10.1177/00131644231205310","DOIUrl":"https://doi.org/10.1177/00131644231205310","url":null,"abstract":"The relative advantages and disadvantages of sum scores and estimated factor scores are issues of concern for substantive research in psychology. Recently, while championing estimated factor scores over sum scores, McNeish offered a trenchant rejoinder to an article by Widaman and Revelle, which had critiqued an earlier paper by McNeish and Wolf. In the recent contribution, McNeish misrepresented a number of claims by Widaman and Revelle, rendering moot his criticisms of Widaman and Revelle. Notably, McNeish chose to avoid confronting a key strength of sum scores stressed by Widaman and Revelle—the greater comparability of results across studies if sum scores are used. Instead, McNeish pivoted to present a host of simulation studies to identify relative strengths of estimated factor scores. Here, we review our prior claims and, in the process, deflect purported criticisms by McNeish. We discuss briefly issues related to simulated data and empirical data that provide evidence of strengths of each type of score. In doing so, we identified a second strength of sum scores: superior cross-validation of results across independent samples of empirical data, at least for samples of moderate size. We close with consideration of four general issues concerning sum scores and estimated factor scores that highlight the contrasts between positions offered by McNeish and by us, issues of importance when pursuing applied research in our field.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135859120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sanaz Nazari, Walter L Leite, A Corinne Huggins-Manley
{"title":"A Comparison of Person-Fit Indices to Detect Social Desirability Bias.","authors":"Sanaz Nazari, Walter L Leite, A Corinne Huggins-Manley","doi":"10.1177/00131644221129577","DOIUrl":"10.1177/00131644221129577","url":null,"abstract":"<p><p>Social desirability bias (SDB) has been a major concern in educational and psychological assessments when measuring latent variables because it has the potential to introduce measurement error and bias in assessments. Person-fit indices can detect bias in the form of misfitted response vectors. The objective of this study was to compare the performance of 14 person-fit indices to identify SDB in simulated responses. The area under the curve (AUC) of receiver operating characteristic (ROC) curve analysis was computed to evaluate the predictive power of these statistics. The findings showed that the agreement statistic <math><mrow><mo>(</mo><mi>A</mi><mo>)</mo></mrow></math> outperformed all other person-fit indices, while the disagreement statistic <math><mrow><mo>(</mo><mi>D</mi><mo>)</mo></mrow></math>, dependability statistic <math><mrow><mo>(</mo><mi>E</mi><mo>)</mo></mrow></math>, and the number of Guttman errors <math><mrow><mo>(</mo><mi>G</mi><mo>)</mo></mrow></math> also demonstrated high AUCs to detect SDB. Recommendations for practitioners to use these fit indices are provided.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"907-928"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470160/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10208755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting Rating Scale Malfunctioning With the Partial Credit Model and Generalized Partial Credit Model.","authors":"Stefanie A Wind","doi":"10.1177/00131644221116292","DOIUrl":"10.1177/00131644221116292","url":null,"abstract":"<p><p>Rating scale analysis techniques provide researchers with practical tools for examining the degree to which ordinal rating scales (e.g., Likert-type scales or performance assessment rating scales) function in psychometrically useful ways. When rating scales function as expected, researchers can interpret ratings in the intended direction (i.e., lower ratings mean \"less\" of a construct than higher ratings), distinguish between categories in the scale (i.e., each category reflects a unique level of the construct), and compare ratings across elements of the measurement instrument, such as individual items. Although researchers have used these techniques in a variety of contexts, studies are limited that systematically explore their sensitivity to problematic rating scale characteristics (i.e., \"rating scale malfunctioning\"). I used a real data analysis and a simulation study to systematically explore the sensitivity of rating scale analysis techniques based on two popular polytomous item response theory (IRT) models: the partial credit model (PCM) and the generalized partial credit model (GPCM). Overall, results indicated that both models provide valuable information about rating scale threshold ordering and precision that can help researchers understand how their rating scales are functioning and identify areas for further investigation or revision. However, there were some differences between models in their sensitivity to rating scale malfunctioning in certain conditions. Implications for research and practice are discussed.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"953-983"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470161/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10506045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Georgios Sideridis, Ioannis Tsaousis, Hanan Ghamdi
{"title":"Equidistant Response Options on Likert-Type Instruments: Testing the Interval Scaling Assumption Using Mplus.","authors":"Georgios Sideridis, Ioannis Tsaousis, Hanan Ghamdi","doi":"10.1177/00131644221130482","DOIUrl":"10.1177/00131644221130482","url":null,"abstract":"<p><p>The purpose of the present study was to provide the means to evaluate the \"interval-scaling\" assumption that governs the use of parametric statistics and continuous data estimators in self-report instruments that utilize Likert-type scaling. Using simulated and real data, the methodology to test for this important assumption is evaluated using the popular software Mplus 8.8. Evidence on meeting the assumption is provided using the Wald test and the equidistant index. It is suggested that routine evaluations of self-report instruments engage the present methodology so that the most appropriate estimator will be implemented when testing the construct validity of self-report instruments.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"885-906"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470166/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10357822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Séverin Lions, Pablo Dartnell, Gabriela Toledo, María Inés Godoy, Nora Córdova, Daniela Jiménez, Julie Lemarié
{"title":"Position of Correct Option and Distractors Impacts Responses to Multiple-Choice Items: Evidence From a National Test.","authors":"Séverin Lions, Pablo Dartnell, Gabriela Toledo, María Inés Godoy, Nora Córdova, Daniela Jiménez, Julie Lemarié","doi":"10.1177/00131644221132335","DOIUrl":"10.1177/00131644221132335","url":null,"abstract":"<p><p>Even though the impact of the position of response options on answers to multiple-choice items has been investigated for decades, it remains debated. Research on this topic is inconclusive, perhaps because too few studies have obtained experimental data from large-sized samples in a real-world context and have manipulated the position of both correct response and distractors. Since multiple-choice tests' outcomes can be strikingly consequential and option position effects constitute a potential source of measurement error, these effects should be clarified. In this study, two experiments in which the position of correct response and distractors was carefully manipulated were performed within a Chilean national high-stakes standardized test, responded by 195,715 examinees. Results show small but clear and systematic effects of options position on examinees' responses in both experiments. They consistently indicate that a five-option item is slightly easier when the correct response is in A rather than E and when the most attractive distractor is after and far away from the correct response. They clarify and extend previous findings, showing that the appeal of all options is influenced by position. The existence and nature of a potential interference phenomenon between the options' processing are discussed, and implications for test development are considered.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"861-884"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470158/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10306861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Impact and Detection of Uniform Differential Item Functioning for Continuous Item Response Models.","authors":"W Holmes Finch","doi":"10.1177/00131644221111993","DOIUrl":"10.1177/00131644221111993","url":null,"abstract":"<p><p>Psychometricians have devoted much research and attention to categorical item responses, leading to the development and widespread use of item response theory for the estimation of model parameters and identification of items that do not perform in the same way for examinees from different population subgroups (e.g., differential item functioning [DIF]). With the increasing use of computer-based measurement, use of items with a continuous response modality is becoming more common. Models for use with these items have been developed and refined in recent years, but less attention has been devoted to investigating DIF for these continuous response models (CRMs). Therefore, the purpose of this simulation study was to compare the performance of three potential methods for assessing DIF for CRMs, including regression, the MIMIC model, and factor invariance testing. Study results revealed that the MIMIC model provided a combination of Type I error control and relatively high power for detecting DIF. Implications of these findings are discussed.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"929-952"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470162/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10506042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting Preknowledge Cheating via Innovative Measures: A Mixture Hierarchical Model for Jointly Modeling Item Responses, Response Times, and Visual Fixation Counts.","authors":"Kaiwen Man, Jeffrey R Harring","doi":"10.1177/00131644221136142","DOIUrl":"10.1177/00131644221136142","url":null,"abstract":"<p><p>Preknowledge cheating jeopardizes the validity of inferences based on test results. Many methods have been developed to detect preknowledge cheating by jointly analyzing item responses and response times. Gaze fixations, an essential eye-tracker measure, can be utilized to help detect aberrant testing behavior with improved accuracy beyond using product and process data types in isolation. As such, this study proposes a mixture hierarchical model that integrates item responses, response times, and visual fixation counts collected from an eye-tracker (a) to detect aberrant test takers who have different levels of preknowledge and (b) to account for nuances in behavioral patterns between normally-behaved and aberrant examinees. A Bayesian approach to estimating model parameters is carried out via an MCMC algorithm. Finally, the proposed model is applied to experimental data to illustrate how the model can be used to identify test takers having preknowledge on the test items.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"1059-1080"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470163/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10525106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The NEAT Equating Via Chaining Random Forests in the Context of Small Sample Sizes: A Machine-Learning Method.","authors":"Zhehan Jiang, Yuting Han, Lingling Xu, Dexin Shi, Ren Liu, Jinying Ouyang, Fen Cai","doi":"10.1177/00131644221120899","DOIUrl":"10.1177/00131644221120899","url":null,"abstract":"<p><p>The part of responses that is absent in the nonequivalent groups with anchor test (NEAT) design can be managed to a planned missing scenario. In the context of small sample sizes, we present a machine learning (ML)-based imputation technique called chaining random forests (CRF) to perform equating tasks within the NEAT design. Specifically, seven CRF-based imputation equating methods are proposed based on different data augmentation methods. The equating performance of the proposed methods is examined through a simulation study. Five factors are considered: (a) test length (20, 30, 40, 50), (b) sample size per test form (50 versus 100), (c) ratio of common/anchor items (0.2 versus 0.3), and (d) equivalent versus nonequivalent groups taking the two forms (no mean difference versus a mean difference of 0.5), and (e) three different types of anchors (random, easy, and hard), resulting in 96 conditions. In addition, five traditional equating methods, (1) Tucker method; (2) Levine observed score method; (3) equipercentile equating method; (4) circle-arc method; and (5) concurrent calibration based on Rasch model, were also considered, plus seven CRF-based imputation equating methods for a total of 12 methods in this study. The findings suggest that benefiting from the advantages of ML techniques, CRF-based methods that incorporate the equating result of the Tucker method, such as IMP_total_Tucker, IMP_pair_Tucker, and IMP_Tucker_cirlce methods, can yield more robust and trustable estimates for the \"missingness\" in an equating task and therefore result in more accurate equated scores than other counterparts in short-length tests with small samples.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"984-1006"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470159/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10357823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ivy Liu, Thomas Suesse, Samuel Harvey, Peter Yongqi Gu, Daniel Fernández, John Randal
{"title":"Generalized Mantel-Haenszel Estimators for Simultaneous Differential Item Functioning Tests.","authors":"Ivy Liu, Thomas Suesse, Samuel Harvey, Peter Yongqi Gu, Daniel Fernández, John Randal","doi":"10.1177/00131644221128341","DOIUrl":"10.1177/00131644221128341","url":null,"abstract":"<p><p>The Mantel-Haenszel estimator is one of the most popular techniques for measuring differential item functioning (DIF). A generalization of this estimator is applied to the context of DIF to compare items by taking the covariance of odds ratio estimators between dependent items into account. Unlike the Item Response Theory, the method does not rely on the local item independence assumption which is likely to be violated when one item provides clues about the answer of another item. Furthermore, we use these (co)variance estimators to construct a hypothesis test to assess DIF for multiple items simultaneously. A simulation study is presented to assess the performance of several tests. Finally, the use of these DIF tests is illustrated via application to two real data sets.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"1007-1032"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470165/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10506044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}