Circumventing construct-irrelevant variance in international assessments using cognitive diagnostic modeling: A curriculum-sensitive measure

IF 2.6 2区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Studies in Educational Evaluation Pub Date : 2024-08-19 DOI:10.1016/j.stueduc.2024.101393

Hannah Heister , Rolf Strietholt , Philipp Doebler , Purya Baghaei

{"title":"Circumventing construct-irrelevant variance in international assessments using cognitive diagnostic modeling: A curriculum-sensitive measure","authors":"Hannah Heister , Rolf Strietholt , Philipp Doebler , Purya Baghaei","doi":"10.1016/j.stueduc.2024.101393","DOIUrl":null,"url":null,"abstract":"<div><p>International large-scale assessments such as TIMSS administer achievement tests that are based on an analysis of national curricula to compare student achievement across countries. The organizations that coordinate these studies use Rasch or more generalized item response theory (IRT) models in which all test items are assumed to measure a single latent ability. The test responses are then used to estimate this ability, and the ability scores are used to compare countries.</p><p>A central but yet-to-be-contested assumption of this approach is that the achievement tests measure an unobserved unidimensional continuous variable that is comparable across countries. One threat to this assumption is the fact that countries and even regions or school tracks within countries have different curricula. When seeking to fairly compare countries, it seems legitimate to account for the fact that applicable curricula differ and that some students may not have been taught the full test content yet. When seeking to fairly compare countries, it seems imperative to account for the fact that national curricula differ and that some countries may not have taught the full test content yet. Nevertheless, existing IRT-based rankings ignore such differences.</p><p>The present study proposes a direct method to deal with differing curricula and create a fair ranking of educational quality between countries. The new method compares countries solely on test content that has already been taught; it uses information on whether students have mastered skills taught in class or not and does not consider contents that have not been taught yet. Mastery is assessed via the deterministic-input, noisy, “and” gate (DINA) model, an interpretable and tractable cognitive diagnostic model. To illustrate the new method, we use data from TIMSS 1995 and compare it to the IRT-based scores published in the international study report. We find a mismatch between the TIMSS test contents and national curricula in all countries. At the same time, we observe a high correlation between the scores based on the new method and the conventional IRT scores. This finding underscores the robustness of the performance measures reported in TIMSS despite existing differences across national curricula.</p></div>","PeriodicalId":47539,"journal":{"name":"Studies in Educational Evaluation","volume":"83 ","pages":"Article 101393"},"PeriodicalIF":2.6000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in Educational Evaluation","FirstCategoryId":"95","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0191491X24000725","RegionNum":2,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

Abstract

International large-scale assessments such as TIMSS administer achievement tests that are based on an analysis of national curricula to compare student achievement across countries. The organizations that coordinate these studies use Rasch or more generalized item response theory (IRT) models in which all test items are assumed to measure a single latent ability. The test responses are then used to estimate this ability, and the ability scores are used to compare countries.

A central but yet-to-be-contested assumption of this approach is that the achievement tests measure an unobserved unidimensional continuous variable that is comparable across countries. One threat to this assumption is the fact that countries and even regions or school tracks within countries have different curricula. When seeking to fairly compare countries, it seems legitimate to account for the fact that applicable curricula differ and that some students may not have been taught the full test content yet. When seeking to fairly compare countries, it seems imperative to account for the fact that national curricula differ and that some countries may not have taught the full test content yet. Nevertheless, existing IRT-based rankings ignore such differences.

The present study proposes a direct method to deal with differing curricula and create a fair ranking of educational quality between countries. The new method compares countries solely on test content that has already been taught; it uses information on whether students have mastered skills taught in class or not and does not consider contents that have not been taught yet. Mastery is assessed via the deterministic-input, noisy, “and” gate (DINA) model, an interpretable and tractable cognitive diagnostic model. To illustrate the new method, we use data from TIMSS 1995 and compare it to the IRT-based scores published in the international study report. We find a mismatch between the TIMSS test contents and national curricula in all countries. At the same time, we observe a high correlation between the scores based on the new method and the conventional IRT scores. This finding underscores the robustness of the performance measures reported in TIMSS despite existing differences across national curricula.

查看原文本刊更多论文

利用认知诊断模型规避国际评估中与建构无关的差异：对课程敏感的测量方法

国际大规模评估（如 TIMSS）在分析国家课程的基础上进行成绩测试，以比较各国学生的成绩。协调这些研究的组织使用 Rasch 或更广义的项目反应理论（IRT）模型，其中所有测试项目都被假定为测量一种潜在能力。这种方法的一个核心但尚待争议的假设是，成绩测验测量的是一个未观察到的单维度连续变量，这种变量在各国之间具有可比性。对这一假设的一个威胁是，各国，甚至各国国内的地区或学校都有不同的课程。在寻求对各国进行公平比较时，考虑到适用的课程不同，以及一些学生可能尚未学习全部测试内容，似乎是合理的。在寻求对各国进行公平比较时，似乎必须考虑到各国的课程设置不同，而且有些国家可能尚未教授全部测试内容。然而，现有的基于 IRT 的排名忽略了这些差异。本研究提出了一种直接的方法来处理不同的课程，并对各国的教育质量进行公平排名。新方法仅根据已教授的测试内容对各国进行比较；它使用的信息是学生是否掌握了课堂上教授的技能，而不考虑尚未教授的内容。掌握程度通过确定性输入、噪声、"和 "门（DINA）模型进行评估，该模型是一种可解释、可操作的认知诊断模型。为了说明这一新方法，我们使用了 1995 年国际数学与科学教育峰会（TIMSS）的数据，并将其与国际研究报告中公布的基于 IRT 的分数进行了比较。我们发现，所有国家的 TIMSS 测试内容与国家课程之间都存在不匹配。与此同时，我们发现基于新方法的分数与传统的 IRT 分数之间具有很高的相关性。这一发现突出表明，尽管各国课程之间存在差异，但 TIMSS 报告的成绩衡量标准是可靠的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Studies in Educational Evaluation Multiple-

CiteScore

6.90

自引率

6.50%

发文量

审稿时长

62 days

期刊介绍： Studies in Educational Evaluation publishes original reports of evaluation studies. Four types of articles are published by the journal: (a) Empirical evaluation studies representing evaluation practice in educational systems around the world; (b) Theoretical reflections and empirical studies related to issues involved in the evaluation of educational programs, educational institutions, educational personnel and student assessment; (c) Articles summarizing the state-of-the-art concerning specific topics in evaluation in general or in a particular country or group of countries; (d) Book reviews and brief abstracts of evaluation studies.