The dire disregard of measurement invariance testing in psychological science.

IF 7.8 1区心理学 Q1 PSYCHOLOGY, MULTIDISCIPLINARY

Psychological methods Pub Date : 2025-10-01 Epub Date: 2023-12-25 DOI:10.1037/met0000624

Esther Maassen, E Damiano D'Urso, Marcel A L M van Assen, Michèle B Nuijten, Kim De Roover, Jelte M Wicherts

{"title":"The dire disregard of measurement invariance testing in psychological science.","authors":"Esther Maassen, E Damiano D'Urso, Marcel A L M van Assen, Michèle B Nuijten, Kim De Roover, Jelte M Wicherts","doi":"10.1037/met0000624","DOIUrl":null,"url":null,"abstract":"<p><p>Self-report scales are widely used in psychology to compare means in latent constructs across groups, experimental conditions, or time points. However, for these comparisons to be meaningful and unbiased, the scales must demonstrate measurement invariance (MI) across compared time points or (experimental) groups. MI testing determines whether the latent constructs are measured equivalently across groups or time, which is essential for meaningful comparisons. We conducted a systematic review of 426 psychology articles with openly available data, to (a) examine common practices in conducting and reporting of MI testing, (b) assess whether we could reproduce the reported MI results, and (c) conduct MI tests for the comparisons that enabled sufficiently powerful MI testing. We identified 96 articles that contained a total of 929 comparisons. Results showed that only 4% of the 929 comparisons underwent MI testing, and the tests were generally poorly reported. None of the reported MI tests were reproducible, and only 26% of the 174 newly performed MI tests reached sufficient (scalar) invariance, with MI failing completely in 58% of tests. Exploratory analyses suggested that in nearly half of the comparisons where configural invariance was rejected, the number of factors differed between groups. These results indicate that MI tests are rarely conducted and poorly reported in psychological studies. We observed frequent violations of MI, suggesting that reported differences between (experimental) groups may not be solely attributed to group differences in the latent constructs. We offer recommendations aimed at improving reporting and computational reproducibility practices in psychology. (PsycInfo Database Record (c) 2025 APA, all rights reserved).</p>","PeriodicalId":20782,"journal":{"name":"Psychological methods","volume":" ","pages":"966-979"},"PeriodicalIF":7.8000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Psychological methods","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1037/met0000624","RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/12/25 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"PSYCHOLOGY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Self-report scales are widely used in psychology to compare means in latent constructs across groups, experimental conditions, or time points. However, for these comparisons to be meaningful and unbiased, the scales must demonstrate measurement invariance (MI) across compared time points or (experimental) groups. MI testing determines whether the latent constructs are measured equivalently across groups or time, which is essential for meaningful comparisons. We conducted a systematic review of 426 psychology articles with openly available data, to (a) examine common practices in conducting and reporting of MI testing, (b) assess whether we could reproduce the reported MI results, and (c) conduct MI tests for the comparisons that enabled sufficiently powerful MI testing. We identified 96 articles that contained a total of 929 comparisons. Results showed that only 4% of the 929 comparisons underwent MI testing, and the tests were generally poorly reported. None of the reported MI tests were reproducible, and only 26% of the 174 newly performed MI tests reached sufficient (scalar) invariance, with MI failing completely in 58% of tests. Exploratory analyses suggested that in nearly half of the comparisons where configural invariance was rejected, the number of factors differed between groups. These results indicate that MI tests are rarely conducted and poorly reported in psychological studies. We observed frequent violations of MI, suggesting that reported differences between (experimental) groups may not be solely attributed to group differences in the latent constructs. We offer recommendations aimed at improving reporting and computational reproducibility practices in psychology. (PsycInfo Database Record (c) 2025 APA, all rights reserved).

查看原文本刊更多论文

心理科学中对测量不变性测试的严重漠视。

在心理学中，自我报告量表被广泛用于比较不同组别、实验条件或时间点的潜在结构的平均值。然而，要使这些比较有意义且无偏见，量表必须在比较的时间点或（实验）组间表现出测量不变性（MI）。测量不变性测试可确定各组或各时间点对潜构的测量是否等效，这对于进行有意义的比较至关重要。我们对 426 篇公开数据的心理学文章进行了系统性回顾，目的是：（a）检查进行和报告 MI 检验的常见做法；（b）评估我们是否能重现所报告的 MI 结果；以及（c）对能进行足够强大 MI 检验的比较进行 MI 检验。我们确定了 96 篇文章，共包含 929 项比较。结果显示，在 929 项比较中，只有 4% 的比较进行了多元智能测试，而且这些测试的报道普遍较少。所报道的 MI 测试都不具有可重复性，在 174 项新进行的 MI 测试中，只有 26% 达到了足够的（标度）不变性，58% 的测试完全不符合 MI 标准。探索性分析表明，在配置不变量被拒绝的近一半比较中，各组之间的因子数量存在差异。这些结果表明，在心理学研究中，多元智能测试很少进行，报告也很少。我们观察到经常出现违反多元智能的情况，这表明所报告的（实验）组间差异可能并不完全归因于潜在建构的组间差异。我们提出了旨在改进心理学报告和计算可重复性实践的建议。(PsycInfo Database Record (c) 2023 APA, 版权所有）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Psychological methods PSYCHOLOGY, MULTIDISCIPLINARY-

CiteScore

13.10

自引率

7.10%

发文量

159

期刊介绍： Psychological Methods is devoted to the development and dissemination of methods for collecting, analyzing, understanding, and interpreting psychological data. Its purpose is the dissemination of innovations in research design, measurement, methodology, and quantitative and qualitative analysis to the psychological community; its further purpose is to promote effective communication about related substantive and methodological issues. The audience is expected to be diverse and to include those who develop new procedures, those who are responsible for undergraduate and graduate training in design, measurement, and statistics, as well as those who employ those procedures in research.