Still Interested in Multidimensional Item Response Theory Modeling? Here Are Some Thoughts on How to Make It Work in Practice

IF 2.7 4区 教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH
Terry A. Ackerman, Richard M. Luecht
{"title":"Still Interested in Multidimensional Item Response Theory Modeling? Here Are Some Thoughts on How to Make It Work in Practice","authors":"Terry A. Ackerman,&nbsp;Richard M. Luecht","doi":"10.1111/emip.12645","DOIUrl":null,"url":null,"abstract":"<p>Given tremendous improvements over the past three to four decades in the computational methods and computer technologies needed to estimate the parameters for higher dimensionality models (Cai, <span>2010a, 2010b</span>, <span>2017</span>), we might expect that MIRT would by now be a widely used array of models and psychometric software tools being used operationally in many educational assessment settings. Perhaps one of the few areas where MIRT has helped practitioners is in the area of understanding Differential Item Functioning (DIF) (Ackerman &amp; Ma, <span>2024</span>; Camilli, <span>1992</span>; Shealy &amp; Stout, <span>1993</span>). Nevertheless, the expectation has not been met nor do there seem to be many operational initiatives to change the <i>status quo</i>.</p><p>Some research psychometricians might lament the lack of large-scale applications of MIRT in the field of educational assessment. However, the simple fact is that MIRT has not lived up to its early expectations nor its potential due to several barriers. Following a discussion of test purpose and metric design issues in the next section, we will examine some of the barriers associated with these topics and provide suggestions for overcoming or completely avoiding them.</p><p>Tests developed for one purpose are rarely of much utility for another purpose. For example, professional certification and licensure tests designed to optimize pass-fail classifications are often not very useful for reporting scores across a large proficiency range—at least not unless the tests are extremely long. Summative, and most interim assessments used in K–12 education, are usually designed to produce reliable total-test scores. The resulting scale scores are summarized as descriptive statistical aggregations of scale scores or other functions of the scores such as classifying students in ordered achievement levels (e.g., Below Basic, Basic, Proficient, Advanced), or in modeling student growth in a subject area as part of an educational accountability system. Some commercially available online “interim” assessments provide limited progress-oriented scores and subscores from on-demand tests. However, the defensible formative utility of most interim assessments remains limited because test development and psychometric analytics follow the summative assessment test design and development paradigm: focusing on maintaining vertically aligned or equated, unidimensional scores scales (e.g., a K–12 math scale).</p><p>The requisite test design and development frameworks for summative tests focus on the relationships between the item responses and the total test score scale (e.g., maximizing item-total score correlations and the conditional reliability within prioritized regions of that score scale).</p><p>Applying MIRT models to most summative or interim assessments makes little sense. The problem is that we continue to allow policymakers to make claims about score interpretations that are not supported by the test or scale design. The standards on most K–12 tests are not multidimensional. Rather, they are a taxonomy of unordered statements—many of which cannot be measured by typical test items—that vaguely reflect the intended scope of an assessment. Some work is underway to provide “assessment standards” that reflect an ordered set of proficiency claims and associated evidence (measurement information) that changes in complexity. The reported scores may be viewed as composite measures representing two or more content domains or subdomains. But they still tend to function as a unidimensional scale. A unidimensional composite can be a mixture of multiple subdomains or content areas as long as the underlying, unitary trait can be empirically demonstrated to satisfy local independence under a particular IRT model.</p><p>Most MIRT studies applied to summative and interim tests are just exploratory factor analyses. That is, the models may help isolate minor amounts of nuanced multidimensionality and researchers may then attempt to interpret the patterns of residual covariance in some content-focused way. However, whenever we develop and select items with high item-total score correlations (e.g., point-biserial correlations), we build our tests to provide a singular measurement signal—an essentially unidimensional scale. We might pretend that we can legitimately organize items into content-based strands and report subscores. However, the subscore item groupings tend to be statistically unjustified and merely result in less reliable estimates of the (essentially) unidimensional trait supported by the data (Haberman &amp; Sinharay, <span>2010</span>). The point is that subscores—any subscore on an essentially unidimensional test—should NOT be computed nor reported. Developing reliable, valid, and useful subscore profiles demands a commitment to designing and maintaining <i>multiple</i> scales.</p><p>Instead, consider a different and perhaps more useful <i>formative</i> assessment purpose—at least potentially useful to teachers, parents, and importantly, the students. While some conceptualize formative assessments as low-stakes classroom assessments, the critical value of these assessments for improving instruction and changing student learning in a positive way cannot be minimized.</p><p>Well-designed formative assessments should arguably be based on <i>multiple</i> metrics that are demonstrably sensitive to good instruction, curricular design, and student learning. They further need to be offered on-demand—possibly daily—and provide immediate or at least timely, detailed, and pedagogically, actionable information to teachers. From a test design and development perspective, the implication is that formative assessments must provide useful and informative performance <i>profiles</i> for individual students that reliably identify valid student strengths on which to build and weaknesses to remediate, as well as simultaneously monitoring progress concerning multiple traits or competency-based metrics.</p><p>The central uses of formative assessments align well with the capabilities of MIRT modeling where the latter provides numerous technical psychometric tools for building and maintaining multiple score scales. [Note: this statement extends to diagnostic classification models (DCMs) where discrete, ordered traits or attributes replace the continuous proficiency metrics assumed for most MIRT models. See, for example, Sessoms and Henson (<span>2018</span>). One promising new approach for diagnostic DCM is by Stout et al. (<span>2023</span>).] The challenge is that adopting a MIRT model or DCM is not, in and of itself, a formative assessment solution. A different test design and development paradigm is needed.</p><p>At this juncture, it seems important to remind ourselves that we fit psychometric models to data—not the reverse. Therefore, the important issues in this article are <i>not</i> centered on which MIRT or DCM model to use or which statistical parameter estimators to employ. Those are psychometric calibration and scaling choices. The most important issues revolve around the characteristics of the data—starting with how we can efficiently design and create items, and then assemble test forms that can meet our intended formative assessment information needs. Our extensive experience with, and research-based knowledge from, large-scale summative testing may not apply to formative assessment systems. For example, we need to consider different mechanisms for evaluating item quality, calibrating the items, linking or equating scales, and scoring student performances.</p><p>If we can agree on the utility of formative assessments and the ensuing need for multiple constructs, the obvious question becomes, “<i>Which constructs should we measure</i>?” It is not sufficient to write test questions to nebulous content and/or cognitive specifications associated with content-based subdomains, factor analyze the results from large-scale field trials, and then play the “<i>Name that Factor</i>” game. Each scale needs to have a concrete purpose supported by its design properties and development priorities.</p><p>Figure 1 displays the four primary domains from the <i>Common Core State Standards (CCSS)</i> for Grade 2 Mathematics (NGA &amp; CCSSO, <span>2010</span>). Additional detail is provided at the <i>Clusters</i> and <i>Standards</i> levels for the <i>Measurement &amp; Data</i> domain.</p><p>Now contemplate this CCSS example from the perspective of a formative assessment design. At the very least, we would need four score scales, one for each domain (2.OA, 2.NBT, 2.MD, and 2.G). While likely to be positively correlated with one another, it seems implausible that most second-grade students would have the same educational opportunities to learn and almost identical levels of mathematics knowledge and skills—or even highly consistent patterns of performance across these four domain-based proficiencies. Often, dimensionality is correlated with the placement of students with learning. Before receiving instruction or long after mastering the material, data appear unidimensional. It is only when students are actively challenged by learning that dimensions appear. A well-designed formative assessment system might <i>expect</i> to observe different score patterns to emerge across the four domains reflecting different patterns of student strengths and weaknesses concerning the domain-specific knowledge and skills measured.</p><p>Consider Figure 2. The left side of the figure depicts the intended structure. That is, the four ellipses are the constructs with the curved connectors denoting nonzero covariances among the scales. The middle image shows the potential magnitude of the six covariances between the traits—which would be proportional to the cosines of the angles between each domain-based scale. From a measurement perspective, each trait is a factor or reference composite (Luecht &amp; Miller, <span>1992</span>; Wang, <span>1986</span>) that psychometrically functions as a unique scale. Finally, the right side shows the score profiles for three students. This figure outlines a high-level target formative assessment scale design!</p><p>Figure 3 explicitly shows more detail about how our test and scale design goals are substantially different for unidimensional (summative or interim) and formative assessments. Under a unidimensional design paradigm (left side of Figure 3) all items within each of the included content domains are expected to be highly correlated with one another (+++) and with all items in other domains. Conversely, for a formative test designed to measure four rather distinct domains (right side of Figure 2). The items would correlate well within a domain but not as highly across domains.</p><p>These correlational patterns are intentional design goals. They are not accidental nor are they likely to be “discovered” via creative factor analyses. We need to build our items and the scales synchronously so that statistical structural modeling confirms the intended scale structures and properties of the test items. If we subsequently find that our target structures are not being met, we will likely need to impose serious item design constraints on the cognitive complexity of the items within each domain to deemphasize and isolate the intended signals for our constructs.</p><p>The nature of the (intended) multidimensionality underlying may also change over time as shown in Figure 4. This figure provides an example of a high-level scale-design requirement that might support a domain-specific formative assessment system to be used during a single semester or entire academic year. Some foundational knowledge and skill domains may become more prominent early in the semester and disappear or be absorbed later on. See Luecht (<span>2003</span>) for a substantive example involving the changing dimensionality of oral language proficiency. Figure 4 shows four <i>epochs</i> (e.g., time periods that could correspond to different learning modules or the scope and sequence of learning objectives). There are also five measured traits or constructs (<i>θ</i><sub>1</sub>, <i>θ</i><sub>2</sub>,…,<i>θ</i><sub>5</sub>). The structural diagrams near the top of each rectangle denote the constructs and their inter-relationships (i.e., covariances). The double-ended arrows denote the nature of those covariances where smaller angles correspond with higher covariances or correlations, proportional to the cosines of the angles.</p><p>Figure 4 suggests that we need two scales for Epoch #1 <i>θ</i><sub>1</sub> and <i>θ</i><sub>2</sub>. We transition to three scales to profile student strengths and weaknesses at Epoch #2, where <i>θ</i><sub>1</sub> and <i>θ</i><sub>2</sub> become more highly correlated and a third scale, <i>θ</i><sub>3</sub>, is needed. (Note that the rationale for the increased magnitude of the correlation between <i>θ</i><sub>1</sub> and <i>θ</i><sub>2</sub> as we move from Epoch #1 to #2 is, as students learn and simultaneously master two or more traits, we would expect the inter-trait correlations to increase toward a unidimensional composite.) At Epoch #3, <i>θ</i><sub>1</sub> drops out of the intended multidimensional score profile and <i>θ</i><sub>4</sub> emerges—with <i>θ</i><sub>2</sub> and <i>θ</i><sub>3</sub> now becoming more correlated with one another. Finally, by Epoch #4, <i>θ</i><sub>1</sub>, <i>θ</i><sub>2</sub>, and <i>θ</i><sub>3</sub> coalesce, and a fifth trait, <i>θ</i><sub>5</sub>, is added to the measurement profile.</p><p>The type of dynamic score-scale profile design implied by Figure 4 is unlikely to happen by starting with an item bank constructed to primarily support a unidimensional score scale (e.g., a summative test item bank or a simply a bank where items are screened have the highest possible item-total test score correlations). Building and maintaining multiple useful scales to support robust score profiles starts with an in-depth understanding of the nature of each construct at a very detailed level. We are essentially building multiple test batteries (e.g., a battery for each epoch shown in Figure 4). This is not a trivial undertaking but, as discussed further on, is feasible using a <i>principled</i> scale and item design framework.</p><p>The test design and development challenges for building and maintaining multidimensional score scales to support formative assessment are not trivial. There can be substantial up-front costs to engineer a robust infrastructure that can: (a) generate potentially massive item banks: (b) support the intended multidimensional score scale structure(s); (c) significantly reduce or eliminate pilot-testing and calibration of individual items; and (d) support optimal within-domain test assembly constraints and forms construction for on-demand testing, possible with multiple epochs (Figure 4). The longer-term payoffs emerge from integrating multidimensional modeling efforts with a robust architecture for generating isomorphic items to detailed task model specifications.</p><p>Following the Assessment Engineering framework, Luecht and Burke (<span>2020</span>) proposed a test design and development paradigm that concentrates scale design on manipulable properties of the tasks (also see Luecht et al., <span>2010</span>; Luecht, 2012a, 2012b, <span>2013, 2016</span>). Assessment Engineering advocates for laying out in a detailed fashion the ordered proficiency claims for each scale. For example, foundational claims about knowledge and skills are superseded by incrementally more rigorous claims. The core Assessment Engineering item and test development technology is called <i>task modeling</i>.</p><p>Task modeling focuses on engineering two critical aspects of instrument design. First, it incorporates item-difficulty modeling research to establish empirically verified complexity design layers that control the cognitive complexity and difficulty of entire families of items called <i>task-model families</i>. Note that discrimination is NOT something that we care about directly with item difficulty modeling. By controlling cognitive complexity via the complexity design layers, we can statistically <i>locate</i> each task-model family on the underlying scale and further use them as quality control indicators for new items as they are generated for each family. Items within each task-model family are treated as statistically and substantively <i>isomorphic</i> (i.e., exchangeable use for scoring and score interpretation purposes as evidence of specific proficiency claims). As discussed further on, this isomorphic property—subject to empirical verification—has enormous benefits for using “data-hungry” MIRT models as part of an operational formative assessment system. The second aspect involves developing a task model map for each scale. Task model maps replace more traditional content blueprints with a highly detail test assembly specification that simultaneously represents the test-form content and the statistical measurement information target(s).</p><p>Figure 5 depicts the alignment between items, task models, and complexity design layers that control the difficulties (locations) of the task model families. Task complexity is increased (left to right) along each class of complexity design layer metrics. Each task model family is represented by one or more item models that can generate multiple, isomorphic instantiations. The task model map represents the comprehensive content, cognitive, and statistical blueprint for building parallel test forms on a single scale.</p><p>So how does Assessment Engineering task modeling address the challenges of formative assessment and MIRT modeling? There is a three-part answer to that question. First, we need to recognize that each scale is an independent progression of incrementally more complex task challenges. We need to map the proficiency claims for each scale as a unique construct independent of all other constructs. Assessment Engineering construct mapping details the progression of <i>domain-specific</i> proficiency-based claims and expected evidence to support those claims. Second, we develop task model maps that provide the collection of incrementally more complex task challenges along each domain-specific scale. Different scales are likely to represent different types of complexity design layers by design. Third, we do not complicate test design by adopting complicated MIRT models that we expect to absorb nuanced misfit within or across task model families.</p><p>There are three ways to generate the item families for each task model family: (1) using human item writers; (2) building or licensing automatic item generation software and customizing it for each domain of interest; or (3) employing large language models and generative artificial intelligence to produce the items.</p><p>Using human item writers can prove to be cost-effective when relatively small numbers of items per task model family are needed (e.g., fewer than 50 items per family). The item writers should use tightly controlled templates that limit their creativity in generating the variant items. In addition, active quality control procedures can be implemented that incorporate natural language processing tools and CDL-based feedback to help the item writers refine each family.</p><p>Automated item generation is implemented using a dedicated software application. High-quality parent items within each domain are <i>parameterized</i> by having subject-matter experts note particular segments (words, phrases, variables, or values) that can take on multiple values, subject to plausibility constraints (Embretson, <span>1999</span>; Embretson &amp; Kingston, <span>2018</span>; Gierl &amp; Haladyna, <span>2012</span>; Gierl &amp; Lai, <span>2012</span>; Gierl et al., <span>2012</span>). Natural language processing syntax and plausibility checking can also be implemented to screen out some of the generated items. AIG can produce hundreds or thousands of items from a given parent item.</p><p>Finally, the use of large language models and generative artificial intelligence is still relatively new for operational test development. There has been some success in language testing using large language models based on transformer architectures like OpenAI's <i>ChatGPT</i>™ and <i>Google's</i> BERT; however, most of these applications still rely on human review and refinement of the generated test content. A more in-depth discussion of these technologies and their relative merits is beyond the scope of this article.</p><p>The key point is that we must focus on the intended properties of the task model families. We can then replicate our procedures for the task models mapped to the other formative scales we are attempting to build (and maintain).</p><p>It is entirely possible to treat items as isomorphic (randomly exchangeable) within task model families and use a common-siblings calibration strategy where the response data are collapsed for each family (Shu et al., <span>2010</span>; Sinharay &amp; Johnson, 2003, <span>2013</span>). Of course, the assumption of isomorphism of the within-task model families item operating characteristics needs to be empirically verified to an adequate level of <i>tolerable</i> variance (e.g., Luecht, <span>2024</span>; Someshwar, <span>2024</span>). A further simplification would be to treat each formative trait as a separate scale for calibration purposes.</p><p>Simple structure can be imposed such that individual items and families only impact a single score scale (i.e., the item discrimination vector, <b><i>a</i></b><i><sub>f</sub></i>, has only one nonzero element corresponding to the domain two to which it belongs). There are practical calibration and scale maintenance benefits associated with isolating the item parameters for each domain using a simple structure paradigm. That is, under that paradigm, each item family effectively only has one discrimination corresponding to the intended dimension.</p><p>Estimating the hybrid model uses a single response vector for each family. The item-within-family difficulties are estimated from that same (collapsed) response vector. However, the estimation also requires a separate indexing variable for the item instances within a family. Only that subvector of responses is used to estimate each item's difficulty per the model.</p><p>It should be noted that there are alternatives to the proposed model in the paper that could be adapted to a MIRT context. These include the explanatory IRT models (EIRTM) proposed by De Boeck and Wilson (<span>2004</span>) and De Boeck et al. (<span>2011</span>), or extensions of Fischer's linear logistic test model (Fischer, <span>1995</span>) or Embretson's general multicomponent latent trait model (Embretson, <span>1984</span>).  It is important to emphasize that the model of choice—whatever it is—has to be able to deal with the notion of “item families,” rather than individual items—see Sinharay and Johnson (<span>2013</span>) and Geerlings et al. (<span>2011</span>). Item families are created by following a design and content generation process that produces consistent “principled” multidimensional information. It should further be noted that if properly implemented, we may not need overly complex models for operational formative testing.</p><p>We probably shouldn't rely on EIRTMs to “discover” the complexity dimensions that should go into task and item design. Good design is intentional—a tenet of Assessment Engineering and a core principle of industrial engineering. The corollary is that our psychometric models should confirm the intended scale design in as simple and straightforward a manner as possible.  EIRTMs typically incorporate person factors (demographic variables or covariates) that “explain” differential performance (person-by-item interactions). Additionally, one could apply EIRTMs to task model families instead of individual items.</p><p>MIRT modeling has largely been limited to simulation studies or empirical exploratory factor analytic studies with essentially unidimensional data and score scales. This article makes the somewhat bold assertation that the benefits of MIRT modeling can only be realized if we match its capabilities to a test purpose with supporting test design and development activities that require multiple score scales. Formative assessment may be that ideal application.</p><p>However, useful formative assessments come with some costs, figuratively and literally speaking. For example, it is not inconceivable that hundreds of isomorphic items might be needed for each task model family within each domain to support on-demand testing. That is, formative assessments should be on-demand and provide timely profiles of multiple, instructionally sensitive score scales that respond to good instruction, curricular design, and student learning. Getting the needed multidimensional measurement information is where principled frameworks like Assessment Engineering can help in terms of design, item production, efficient calibration of item families, and overall quality assurance for the banks of items.</p><p>By mapping out each construct (scale) and the cognitive complexity of task models needed to inform progress along them, we also build strong validity arguments into the scale design itself. By creating and then calibrating task model families rather than the individual items, we solve some of the item production complications needed to sustain on-demand testing in a formative assessment context. Well-designed task model families can further bypass pilot testing and the need to calibrate individual items.</p><p>The fact that we have to contend with multiple scales and MIRT modeling is not a complication under Assessment Engineering. We merely construct each scale from the ground up and create a synchronized psychometric and test development infrastructure that emphasizes scalable production of items and test forms with intentional ongoing quality control. Over time, through design modifications, our efforts to eliminate all but minor degrees of variation in the statistical operating characteristics within task model families imply and help ensure highly robust and formatively useful scales.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 4","pages":"93-100"},"PeriodicalIF":2.7000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12645","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Educational Measurement-Issues and Practice","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/emip.12645","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0

Abstract

Given tremendous improvements over the past three to four decades in the computational methods and computer technologies needed to estimate the parameters for higher dimensionality models (Cai, 2010a, 2010b, 2017), we might expect that MIRT would by now be a widely used array of models and psychometric software tools being used operationally in many educational assessment settings. Perhaps one of the few areas where MIRT has helped practitioners is in the area of understanding Differential Item Functioning (DIF) (Ackerman & Ma, 2024; Camilli, 1992; Shealy & Stout, 1993). Nevertheless, the expectation has not been met nor do there seem to be many operational initiatives to change the status quo.

Some research psychometricians might lament the lack of large-scale applications of MIRT in the field of educational assessment. However, the simple fact is that MIRT has not lived up to its early expectations nor its potential due to several barriers. Following a discussion of test purpose and metric design issues in the next section, we will examine some of the barriers associated with these topics and provide suggestions for overcoming or completely avoiding them.

Tests developed for one purpose are rarely of much utility for another purpose. For example, professional certification and licensure tests designed to optimize pass-fail classifications are often not very useful for reporting scores across a large proficiency range—at least not unless the tests are extremely long. Summative, and most interim assessments used in K–12 education, are usually designed to produce reliable total-test scores. The resulting scale scores are summarized as descriptive statistical aggregations of scale scores or other functions of the scores such as classifying students in ordered achievement levels (e.g., Below Basic, Basic, Proficient, Advanced), or in modeling student growth in a subject area as part of an educational accountability system. Some commercially available online “interim” assessments provide limited progress-oriented scores and subscores from on-demand tests. However, the defensible formative utility of most interim assessments remains limited because test development and psychometric analytics follow the summative assessment test design and development paradigm: focusing on maintaining vertically aligned or equated, unidimensional scores scales (e.g., a K–12 math scale).

The requisite test design and development frameworks for summative tests focus on the relationships between the item responses and the total test score scale (e.g., maximizing item-total score correlations and the conditional reliability within prioritized regions of that score scale).

Applying MIRT models to most summative or interim assessments makes little sense. The problem is that we continue to allow policymakers to make claims about score interpretations that are not supported by the test or scale design. The standards on most K–12 tests are not multidimensional. Rather, they are a taxonomy of unordered statements—many of which cannot be measured by typical test items—that vaguely reflect the intended scope of an assessment. Some work is underway to provide “assessment standards” that reflect an ordered set of proficiency claims and associated evidence (measurement information) that changes in complexity. The reported scores may be viewed as composite measures representing two or more content domains or subdomains. But they still tend to function as a unidimensional scale. A unidimensional composite can be a mixture of multiple subdomains or content areas as long as the underlying, unitary trait can be empirically demonstrated to satisfy local independence under a particular IRT model.

Most MIRT studies applied to summative and interim tests are just exploratory factor analyses. That is, the models may help isolate minor amounts of nuanced multidimensionality and researchers may then attempt to interpret the patterns of residual covariance in some content-focused way. However, whenever we develop and select items with high item-total score correlations (e.g., point-biserial correlations), we build our tests to provide a singular measurement signal—an essentially unidimensional scale. We might pretend that we can legitimately organize items into content-based strands and report subscores. However, the subscore item groupings tend to be statistically unjustified and merely result in less reliable estimates of the (essentially) unidimensional trait supported by the data (Haberman & Sinharay, 2010). The point is that subscores—any subscore on an essentially unidimensional test—should NOT be computed nor reported. Developing reliable, valid, and useful subscore profiles demands a commitment to designing and maintaining multiple scales.

Instead, consider a different and perhaps more useful formative assessment purpose—at least potentially useful to teachers, parents, and importantly, the students. While some conceptualize formative assessments as low-stakes classroom assessments, the critical value of these assessments for improving instruction and changing student learning in a positive way cannot be minimized.

Well-designed formative assessments should arguably be based on multiple metrics that are demonstrably sensitive to good instruction, curricular design, and student learning. They further need to be offered on-demand—possibly daily—and provide immediate or at least timely, detailed, and pedagogically, actionable information to teachers. From a test design and development perspective, the implication is that formative assessments must provide useful and informative performance profiles for individual students that reliably identify valid student strengths on which to build and weaknesses to remediate, as well as simultaneously monitoring progress concerning multiple traits or competency-based metrics.

The central uses of formative assessments align well with the capabilities of MIRT modeling where the latter provides numerous technical psychometric tools for building and maintaining multiple score scales. [Note: this statement extends to diagnostic classification models (DCMs) where discrete, ordered traits or attributes replace the continuous proficiency metrics assumed for most MIRT models. See, for example, Sessoms and Henson (2018). One promising new approach for diagnostic DCM is by Stout et al. (2023).] The challenge is that adopting a MIRT model or DCM is not, in and of itself, a formative assessment solution. A different test design and development paradigm is needed.

At this juncture, it seems important to remind ourselves that we fit psychometric models to data—not the reverse. Therefore, the important issues in this article are not centered on which MIRT or DCM model to use or which statistical parameter estimators to employ. Those are psychometric calibration and scaling choices. The most important issues revolve around the characteristics of the data—starting with how we can efficiently design and create items, and then assemble test forms that can meet our intended formative assessment information needs. Our extensive experience with, and research-based knowledge from, large-scale summative testing may not apply to formative assessment systems. For example, we need to consider different mechanisms for evaluating item quality, calibrating the items, linking or equating scales, and scoring student performances.

If we can agree on the utility of formative assessments and the ensuing need for multiple constructs, the obvious question becomes, “Which constructs should we measure?” It is not sufficient to write test questions to nebulous content and/or cognitive specifications associated with content-based subdomains, factor analyze the results from large-scale field trials, and then play the “Name that Factor” game. Each scale needs to have a concrete purpose supported by its design properties and development priorities.

Figure 1 displays the four primary domains from the Common Core State Standards (CCSS) for Grade 2 Mathematics (NGA & CCSSO, 2010). Additional detail is provided at the Clusters and Standards levels for the Measurement & Data domain.

Now contemplate this CCSS example from the perspective of a formative assessment design. At the very least, we would need four score scales, one for each domain (2.OA, 2.NBT, 2.MD, and 2.G). While likely to be positively correlated with one another, it seems implausible that most second-grade students would have the same educational opportunities to learn and almost identical levels of mathematics knowledge and skills—or even highly consistent patterns of performance across these four domain-based proficiencies. Often, dimensionality is correlated with the placement of students with learning. Before receiving instruction or long after mastering the material, data appear unidimensional. It is only when students are actively challenged by learning that dimensions appear. A well-designed formative assessment system might expect to observe different score patterns to emerge across the four domains reflecting different patterns of student strengths and weaknesses concerning the domain-specific knowledge and skills measured.

Consider Figure 2. The left side of the figure depicts the intended structure. That is, the four ellipses are the constructs with the curved connectors denoting nonzero covariances among the scales. The middle image shows the potential magnitude of the six covariances between the traits—which would be proportional to the cosines of the angles between each domain-based scale. From a measurement perspective, each trait is a factor or reference composite (Luecht & Miller, 1992; Wang, 1986) that psychometrically functions as a unique scale. Finally, the right side shows the score profiles for three students. This figure outlines a high-level target formative assessment scale design!

Figure 3 explicitly shows more detail about how our test and scale design goals are substantially different for unidimensional (summative or interim) and formative assessments. Under a unidimensional design paradigm (left side of Figure 3) all items within each of the included content domains are expected to be highly correlated with one another (+++) and with all items in other domains. Conversely, for a formative test designed to measure four rather distinct domains (right side of Figure 2). The items would correlate well within a domain but not as highly across domains.

These correlational patterns are intentional design goals. They are not accidental nor are they likely to be “discovered” via creative factor analyses. We need to build our items and the scales synchronously so that statistical structural modeling confirms the intended scale structures and properties of the test items. If we subsequently find that our target structures are not being met, we will likely need to impose serious item design constraints on the cognitive complexity of the items within each domain to deemphasize and isolate the intended signals for our constructs.

The nature of the (intended) multidimensionality underlying may also change over time as shown in Figure 4. This figure provides an example of a high-level scale-design requirement that might support a domain-specific formative assessment system to be used during a single semester or entire academic year. Some foundational knowledge and skill domains may become more prominent early in the semester and disappear or be absorbed later on. See Luecht (2003) for a substantive example involving the changing dimensionality of oral language proficiency. Figure 4 shows four epochs (e.g., time periods that could correspond to different learning modules or the scope and sequence of learning objectives). There are also five measured traits or constructs (θ1, θ2,…,θ5). The structural diagrams near the top of each rectangle denote the constructs and their inter-relationships (i.e., covariances). The double-ended arrows denote the nature of those covariances where smaller angles correspond with higher covariances or correlations, proportional to the cosines of the angles.

Figure 4 suggests that we need two scales for Epoch #1 θ1 and θ2. We transition to three scales to profile student strengths and weaknesses at Epoch #2, where θ1 and θ2 become more highly correlated and a third scale, θ3, is needed. (Note that the rationale for the increased magnitude of the correlation between θ1 and θ2 as we move from Epoch #1 to #2 is, as students learn and simultaneously master two or more traits, we would expect the inter-trait correlations to increase toward a unidimensional composite.) At Epoch #3, θ1 drops out of the intended multidimensional score profile and θ4 emerges—with θ2 and θ3 now becoming more correlated with one another. Finally, by Epoch #4, θ1, θ2, and θ3 coalesce, and a fifth trait, θ5, is added to the measurement profile.

The type of dynamic score-scale profile design implied by Figure 4 is unlikely to happen by starting with an item bank constructed to primarily support a unidimensional score scale (e.g., a summative test item bank or a simply a bank where items are screened have the highest possible item-total test score correlations). Building and maintaining multiple useful scales to support robust score profiles starts with an in-depth understanding of the nature of each construct at a very detailed level. We are essentially building multiple test batteries (e.g., a battery for each epoch shown in Figure 4). This is not a trivial undertaking but, as discussed further on, is feasible using a principled scale and item design framework.

The test design and development challenges for building and maintaining multidimensional score scales to support formative assessment are not trivial. There can be substantial up-front costs to engineer a robust infrastructure that can: (a) generate potentially massive item banks: (b) support the intended multidimensional score scale structure(s); (c) significantly reduce or eliminate pilot-testing and calibration of individual items; and (d) support optimal within-domain test assembly constraints and forms construction for on-demand testing, possible with multiple epochs (Figure 4). The longer-term payoffs emerge from integrating multidimensional modeling efforts with a robust architecture for generating isomorphic items to detailed task model specifications.

Following the Assessment Engineering framework, Luecht and Burke (2020) proposed a test design and development paradigm that concentrates scale design on manipulable properties of the tasks (also see Luecht et al., 2010; Luecht, 2012a, 2012b, 2013, 2016). Assessment Engineering advocates for laying out in a detailed fashion the ordered proficiency claims for each scale. For example, foundational claims about knowledge and skills are superseded by incrementally more rigorous claims. The core Assessment Engineering item and test development technology is called task modeling.

Task modeling focuses on engineering two critical aspects of instrument design. First, it incorporates item-difficulty modeling research to establish empirically verified complexity design layers that control the cognitive complexity and difficulty of entire families of items called task-model families. Note that discrimination is NOT something that we care about directly with item difficulty modeling. By controlling cognitive complexity via the complexity design layers, we can statistically locate each task-model family on the underlying scale and further use them as quality control indicators for new items as they are generated for each family. Items within each task-model family are treated as statistically and substantively isomorphic (i.e., exchangeable use for scoring and score interpretation purposes as evidence of specific proficiency claims). As discussed further on, this isomorphic property—subject to empirical verification—has enormous benefits for using “data-hungry” MIRT models as part of an operational formative assessment system. The second aspect involves developing a task model map for each scale. Task model maps replace more traditional content blueprints with a highly detail test assembly specification that simultaneously represents the test-form content and the statistical measurement information target(s).

Figure 5 depicts the alignment between items, task models, and complexity design layers that control the difficulties (locations) of the task model families. Task complexity is increased (left to right) along each class of complexity design layer metrics. Each task model family is represented by one or more item models that can generate multiple, isomorphic instantiations. The task model map represents the comprehensive content, cognitive, and statistical blueprint for building parallel test forms on a single scale.

So how does Assessment Engineering task modeling address the challenges of formative assessment and MIRT modeling? There is a three-part answer to that question. First, we need to recognize that each scale is an independent progression of incrementally more complex task challenges. We need to map the proficiency claims for each scale as a unique construct independent of all other constructs. Assessment Engineering construct mapping details the progression of domain-specific proficiency-based claims and expected evidence to support those claims. Second, we develop task model maps that provide the collection of incrementally more complex task challenges along each domain-specific scale. Different scales are likely to represent different types of complexity design layers by design. Third, we do not complicate test design by adopting complicated MIRT models that we expect to absorb nuanced misfit within or across task model families.

There are three ways to generate the item families for each task model family: (1) using human item writers; (2) building or licensing automatic item generation software and customizing it for each domain of interest; or (3) employing large language models and generative artificial intelligence to produce the items.

Using human item writers can prove to be cost-effective when relatively small numbers of items per task model family are needed (e.g., fewer than 50 items per family). The item writers should use tightly controlled templates that limit their creativity in generating the variant items. In addition, active quality control procedures can be implemented that incorporate natural language processing tools and CDL-based feedback to help the item writers refine each family.

Automated item generation is implemented using a dedicated software application. High-quality parent items within each domain are parameterized by having subject-matter experts note particular segments (words, phrases, variables, or values) that can take on multiple values, subject to plausibility constraints (Embretson, 1999; Embretson & Kingston, 2018; Gierl & Haladyna, 2012; Gierl & Lai, 2012; Gierl et al., 2012). Natural language processing syntax and plausibility checking can also be implemented to screen out some of the generated items. AIG can produce hundreds or thousands of items from a given parent item.

Finally, the use of large language models and generative artificial intelligence is still relatively new for operational test development. There has been some success in language testing using large language models based on transformer architectures like OpenAI's ChatGPT™ and Google's BERT; however, most of these applications still rely on human review and refinement of the generated test content. A more in-depth discussion of these technologies and their relative merits is beyond the scope of this article.

The key point is that we must focus on the intended properties of the task model families. We can then replicate our procedures for the task models mapped to the other formative scales we are attempting to build (and maintain).

It is entirely possible to treat items as isomorphic (randomly exchangeable) within task model families and use a common-siblings calibration strategy where the response data are collapsed for each family (Shu et al., 2010; Sinharay & Johnson, 2003, 2013). Of course, the assumption of isomorphism of the within-task model families item operating characteristics needs to be empirically verified to an adequate level of tolerable variance (e.g., Luecht, 2024; Someshwar, 2024). A further simplification would be to treat each formative trait as a separate scale for calibration purposes.

Simple structure can be imposed such that individual items and families only impact a single score scale (i.e., the item discrimination vector, af, has only one nonzero element corresponding to the domain two to which it belongs). There are practical calibration and scale maintenance benefits associated with isolating the item parameters for each domain using a simple structure paradigm. That is, under that paradigm, each item family effectively only has one discrimination corresponding to the intended dimension.

Estimating the hybrid model uses a single response vector for each family. The item-within-family difficulties are estimated from that same (collapsed) response vector. However, the estimation also requires a separate indexing variable for the item instances within a family. Only that subvector of responses is used to estimate each item's difficulty per the model.

It should be noted that there are alternatives to the proposed model in the paper that could be adapted to a MIRT context. These include the explanatory IRT models (EIRTM) proposed by De Boeck and Wilson (2004) and De Boeck et al. (2011), or extensions of Fischer's linear logistic test model (Fischer, 1995) or Embretson's general multicomponent latent trait model (Embretson, 1984).  It is important to emphasize that the model of choice—whatever it is—has to be able to deal with the notion of “item families,” rather than individual items—see Sinharay and Johnson (2013) and Geerlings et al. (2011). Item families are created by following a design and content generation process that produces consistent “principled” multidimensional information. It should further be noted that if properly implemented, we may not need overly complex models for operational formative testing.

We probably shouldn't rely on EIRTMs to “discover” the complexity dimensions that should go into task and item design. Good design is intentional—a tenet of Assessment Engineering and a core principle of industrial engineering. The corollary is that our psychometric models should confirm the intended scale design in as simple and straightforward a manner as possible.  EIRTMs typically incorporate person factors (demographic variables or covariates) that “explain” differential performance (person-by-item interactions). Additionally, one could apply EIRTMs to task model families instead of individual items.

MIRT modeling has largely been limited to simulation studies or empirical exploratory factor analytic studies with essentially unidimensional data and score scales. This article makes the somewhat bold assertation that the benefits of MIRT modeling can only be realized if we match its capabilities to a test purpose with supporting test design and development activities that require multiple score scales. Formative assessment may be that ideal application.

However, useful formative assessments come with some costs, figuratively and literally speaking. For example, it is not inconceivable that hundreds of isomorphic items might be needed for each task model family within each domain to support on-demand testing. That is, formative assessments should be on-demand and provide timely profiles of multiple, instructionally sensitive score scales that respond to good instruction, curricular design, and student learning. Getting the needed multidimensional measurement information is where principled frameworks like Assessment Engineering can help in terms of design, item production, efficient calibration of item families, and overall quality assurance for the banks of items.

By mapping out each construct (scale) and the cognitive complexity of task models needed to inform progress along them, we also build strong validity arguments into the scale design itself. By creating and then calibrating task model families rather than the individual items, we solve some of the item production complications needed to sustain on-demand testing in a formative assessment context. Well-designed task model families can further bypass pilot testing and the need to calibrate individual items.

The fact that we have to contend with multiple scales and MIRT modeling is not a complication under Assessment Engineering. We merely construct each scale from the ground up and create a synchronized psychometric and test development infrastructure that emphasizes scalable production of items and test forms with intentional ongoing quality control. Over time, through design modifications, our efforts to eliminate all but minor degrees of variation in the statistical operating characteristics within task model families imply and help ensure highly robust and formatively useful scales.

Abstract Image

求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.90
自引率
15.00%
发文量
47
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信