Still Interested in Multidimensional Item Response Theory Modeling? Here Are Some Thoughts on How to Make It Work in Practice

IF 2.7 4区 教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH
Terry A. Ackerman, Richard M. Luecht
{"title":"Still Interested in Multidimensional Item Response Theory Modeling? Here Are Some Thoughts on How to Make It Work in Practice","authors":"Terry A. Ackerman,&nbsp;Richard M. Luecht","doi":"10.1111/emip.12645","DOIUrl":null,"url":null,"abstract":"<p>Given tremendous improvements over the past three to four decades in the computational methods and computer technologies needed to estimate the parameters for higher dimensionality models (Cai, <span>2010a, 2010b</span>, <span>2017</span>), we might expect that MIRT would by now be a widely used array of models and psychometric software tools being used operationally in many educational assessment settings. Perhaps one of the few areas where MIRT has helped practitioners is in the area of understanding Differential Item Functioning (DIF) (Ackerman &amp; Ma, <span>2024</span>; Camilli, <span>1992</span>; Shealy &amp; Stout, <span>1993</span>). Nevertheless, the expectation has not been met nor do there seem to be many operational initiatives to change the <i>status quo</i>.</p><p>Some research psychometricians might lament the lack of large-scale applications of MIRT in the field of educational assessment. However, the simple fact is that MIRT has not lived up to its early expectations nor its potential due to several barriers. Following a discussion of test purpose and metric design issues in the next section, we will examine some of the barriers associated with these topics and provide suggestions for overcoming or completely avoiding them.</p><p>Tests developed for one purpose are rarely of much utility for another purpose. For example, professional certification and licensure tests designed to optimize pass-fail classifications are often not very useful for reporting scores across a large proficiency range—at least not unless the tests are extremely long. Summative, and most interim assessments used in K–12 education, are usually designed to produce reliable total-test scores. The resulting scale scores are summarized as descriptive statistical aggregations of scale scores or other functions of the scores such as classifying students in ordered achievement levels (e.g., Below Basic, Basic, Proficient, Advanced), or in modeling student growth in a subject area as part of an educational accountability system. Some commercially available online “interim” assessments provide limited progress-oriented scores and subscores from on-demand tests. However, the defensible formative utility of most interim assessments remains limited because test development and psychometric analytics follow the summative assessment test design and development paradigm: focusing on maintaining vertically aligned or equated, unidimensional scores scales (e.g., a K–12 math scale).</p><p>The requisite test design and development frameworks for summative tests focus on the relationships between the item responses and the total test score scale (e.g., maximizing item-total score correlations and the conditional reliability within prioritized regions of that score scale).</p><p>Applying MIRT models to most summative or interim assessments makes little sense. The problem is that we continue to allow policymakers to make claims about score interpretations that are not supported by the test or scale design. The standards on most K–12 tests are not multidimensional. Rather, they are a taxonomy of unordered statements—many of which cannot be measured by typical test items—that vaguely reflect the intended scope of an assessment. Some work is underway to provide “assessment standards” that reflect an ordered set of proficiency claims and associated evidence (measurement information) that changes in complexity. The reported scores may be viewed as composite measures representing two or more content domains or subdomains. But they still tend to function as a unidimensional scale. A unidimensional composite can be a mixture of multiple subdomains or content areas as long as the underlying, unitary trait can be empirically demonstrated to satisfy local independence under a particular IRT model.</p><p>Most MIRT studies applied to summative and interim tests are just exploratory factor analyses. That is, the models may help isolate minor amounts of nuanced multidimensionality and researchers may then attempt to interpret the patterns of residual covariance in some content-focused way. However, whenever we develop and select items with high item-total score correlations (e.g., point-biserial correlations), we build our tests to provide a singular measurement signal—an essentially unidimensional scale. We might pretend that we can legitimately organize items into content-based strands and report subscores. However, the subscore item groupings tend to be statistically unjustified and merely result in less reliable estimates of the (essentially) unidimensional trait supported by the data (Haberman &amp; Sinharay, <span>2010</span>). The point is that subscores—any subscore on an essentially unidimensional test—should NOT be computed nor reported. Developing reliable, valid, and useful subscore profiles demands a commitment to designing and maintaining <i>multiple</i> scales.</p><p>Instead, consider a different and perhaps more useful <i>formative</i> assessment purpose—at least potentially useful to teachers, parents, and importantly, the students. While some conceptualize formative assessments as low-stakes classroom assessments, the critical value of these assessments for improving instruction and changing student learning in a positive way cannot be minimized.</p><p>Well-designed formative assessments should arguably be based on <i>multiple</i> metrics that are demonstrably sensitive to good instruction, curricular design, and student learning. They further need to be offered on-demand—possibly daily—and provide immediate or at least timely, detailed, and pedagogically, actionable information to teachers. From a test design and development perspective, the implication is that formative assessments must provide useful and informative performance <i>profiles</i> for individual students that reliably identify valid student strengths on which to build and weaknesses to remediate, as well as simultaneously monitoring progress concerning multiple traits or competency-based metrics.</p><p>The central uses of formative assessments align well with the capabilities of MIRT modeling where the latter provides numerous technical psychometric tools for building and maintaining multiple score scales. [Note: this statement extends to diagnostic classification models (DCMs) where discrete, ordered traits or attributes replace the continuous proficiency metrics assumed for most MIRT models. See, for example, Sessoms and Henson (<span>2018</span>). One promising new approach for diagnostic DCM is by Stout et al. (<span>2023</span>).] The challenge is that adopting a MIRT model or DCM is not, in and of itself, a formative assessment solution. A different test design and development paradigm is needed.</p><p>At this juncture, it seems important to remind ourselves that we fit psychometric models to data—not the reverse. Therefore, the important issues in this article are <i>not</i> centered on which MIRT or DCM model to use or which statistical parameter estimators to employ. Those are psychometric calibration and scaling choices. The most important issues revolve around the characteristics of the data—starting with how we can efficiently design and create items, and then assemble test forms that can meet our intended formative assessment information needs. Our extensive experience with, and research-based knowledge from, large-scale summative testing may not apply to formative assessment systems. For example, we need to consider different mechanisms for evaluating item quality, calibrating the items, linking or equating scales, and scoring student performances.</p><p>If we can agree on the utility of formative assessments and the ensuing need for multiple constructs, the obvious question becomes, “<i>Which constructs should we measure</i>?” It is not sufficient to write test questions to nebulous content and/or cognitive specifications associated with content-based subdomains, factor analyze the results from large-scale field trials, and then play the “<i>Name that Factor</i>” game. Each scale needs to have a concrete purpose supported by its design properties and development priorities.</p><p>Figure 1 displays the four primary domains from the <i>Common Core State Standards (CCSS)</i> for Grade 2 Mathematics (NGA &amp; CCSSO, <span>2010</span>). Additional detail is provided at the <i>Clusters</i> and <i>Standards</i> levels for the <i>Measurement &amp; Data</i> domain.</p><p>Now contemplate this CCSS example from the perspective of a formative assessment design. At the very least, we would need four score scales, one for each domain (2.OA, 2.NBT, 2.MD, and 2.G). While likely to be positively correlated with one another, it seems implausible that most second-grade students would have the same educational opportunities to learn and almost identical levels of mathematics knowledge and skills—or even highly consistent patterns of performance across these four domain-based proficiencies. Often, dimensionality is correlated with the placement of students with learning. Before receiving instruction or long after mastering the material, data appear unidimensional. It is only when students are actively challenged by learning that dimensions appear. A well-designed formative assessment system might <i>expect</i> to observe different score patterns to emerge across the four domains reflecting different patterns of student strengths and weaknesses concerning the domain-specific knowledge and skills measured.</p><p>Consider Figure 2. The left side of the figure depicts the intended structure. That is, the four ellipses are the constructs with the curved connectors denoting nonzero covariances among the scales. The middle image shows the potential magnitude of the six covariances between the traits—which would be proportional to the cosines of the angles between each domain-based scale. From a measurement perspective, each trait is a factor or reference composite (Luecht &amp; Miller, <span>1992</span>; Wang, <span>1986</span>) that psychometrically functions as a unique scale. Finally, the right side shows the score profiles for three students. This figure outlines a high-level target formative assessment scale design!</p><p>Figure 3 explicitly shows more detail about how our test and scale design goals are substantially different for unidimensional (summative or interim) and formative assessments. Under a unidimensional design paradigm (left side of Figure 3) all items within each of the included content domains are expected to be highly correlated with one another (+++) and with all items in other domains. Conversely, for a formative test designed to measure four rather distinct domains (right side of Figure 2). The items would correlate well within a domain but not as highly across domains.</p><p>These correlational patterns are intentional design goals. They are not accidental nor are they likely to be “discovered” via creative factor analyses. We need to build our items and the scales synchronously so that statistical structural modeling confirms the intended scale structures and properties of the test items. If we subsequently find that our target structures are not being met, we will likely need to impose serious item design constraints on the cognitive complexity of the items within each domain to deemphasize and isolate the intended signals for our constructs.</p><p>The nature of the (intended) multidimensionality underlying may also change over time as shown in Figure 4. This figure provides an example of a high-level scale-design requirement that might support a domain-specific formative assessment system to be used during a single semester or entire academic year. Some foundational knowledge and skill domains may become more prominent early in the semester and disappear or be absorbed later on. See Luecht (<span>2003</span>) for a substantive example involving the changing dimensionality of oral language proficiency. Figure 4 shows four <i>epochs</i> (e.g., time periods that could correspond to different learning modules or the scope and sequence of learning objectives). There are also five measured traits or constructs (<i>θ</i><sub>1</sub>, <i>θ</i><sub>2</sub>,…,<i>θ</i><sub>5</sub>). The structural diagrams near the top of each rectangle denote the constructs and their inter-relationships (i.e., covariances). The double-ended arrows denote the nature of those covariances where smaller angles correspond with higher covariances or correlations, proportional to the cosines of the angles.</p><p>Figure 4 suggests that we need two scales for Epoch #1 <i>θ</i><sub>1</sub> and <i>θ</i><sub>2</sub>. We transition to three scales to profile student strengths and weaknesses at Epoch #2, where <i>θ</i><sub>1</sub> and <i>θ</i><sub>2</sub> become more highly correlated and a third scale, <i>θ</i><sub>3</sub>, is needed. (Note that the rationale for the increased magnitude of the correlation between <i>θ</i><sub>1</sub> and <i>θ</i><sub>2</sub> as we move from Epoch #1 to #2 is, as students learn and simultaneously master two or more traits, we would expect the inter-trait correlations to increase toward a unidimensional composite.) At Epoch #3, <i>θ</i><sub>1</sub> drops out of the intended multidimensional score profile and <i>θ</i><sub>4</sub> emerges—with <i>θ</i><sub>2</sub> and <i>θ</i><sub>3</sub> now becoming more correlated with one another. Finally, by Epoch #4, <i>θ</i><sub>1</sub>, <i>θ</i><sub>2</sub>, and <i>θ</i><sub>3</sub> coalesce, and a fifth trait, <i>θ</i><sub>5</sub>, is added to the measurement profile.</p><p>The type of dynamic score-scale profile design implied by Figure 4 is unlikely to happen by starting with an item bank constructed to primarily support a unidimensional score scale (e.g., a summative test item bank or a simply a bank where items are screened have the highest possible item-total test score correlations). Building and maintaining multiple useful scales to support robust score profiles starts with an in-depth understanding of the nature of each construct at a very detailed level. We are essentially building multiple test batteries (e.g., a battery for each epoch shown in Figure 4). This is not a trivial undertaking but, as discussed further on, is feasible using a <i>principled</i> scale and item design framework.</p><p>The test design and development challenges for building and maintaining multidimensional score scales to support formative assessment are not trivial. There can be substantial up-front costs to engineer a robust infrastructure that can: (a) generate potentially massive item banks: (b) support the intended multidimensional score scale structure(s); (c) significantly reduce or eliminate pilot-testing and calibration of individual items; and (d) support optimal within-domain test assembly constraints and forms construction for on-demand testing, possible with multiple epochs (Figure 4). The longer-term payoffs emerge from integrating multidimensional modeling efforts with a robust architecture for generating isomorphic items to detailed task model specifications.</p><p>Following the Assessment Engineering framework, Luecht and Burke (<span>2020</span>) proposed a test design and development paradigm that concentrates scale design on manipulable properties of the tasks (also see Luecht et al., <span>2010</span>; Luecht, 2012a, 2012b, <span>2013, 2016</span>). Assessment Engineering advocates for laying out in a detailed fashion the ordered proficiency claims for each scale. For example, foundational claims about knowledge and skills are superseded by incrementally more rigorous claims. The core Assessment Engineering item and test development technology is called <i>task modeling</i>.</p><p>Task modeling focuses on engineering two critical aspects of instrument design. First, it incorporates item-difficulty modeling research to establish empirically verified complexity design layers that control the cognitive complexity and difficulty of entire families of items called <i>task-model families</i>. Note that discrimination is NOT something that we care about directly with item difficulty modeling. By controlling cognitive complexity via the complexity design layers, we can statistically <i>locate</i> each task-model family on the underlying scale and further use them as quality control indicators for new items as they are generated for each family. Items within each task-model family are treated as statistically and substantively <i>isomorphic</i> (i.e., exchangeable use for scoring and score interpretation purposes as evidence of specific proficiency claims). As discussed further on, this isomorphic property—subject to empirical verification—has enormous benefits for using “data-hungry” MIRT models as part of an operational formative assessment system. The second aspect involves developing a task model map for each scale. Task model maps replace more traditional content blueprints with a highly detail test assembly specification that simultaneously represents the test-form content and the statistical measurement information target(s).</p><p>Figure 5 depicts the alignment between items, task models, and complexity design layers that control the difficulties (locations) of the task model families. Task complexity is increased (left to right) along each class of complexity design layer metrics. Each task model family is represented by one or more item models that can generate multiple, isomorphic instantiations. The task model map represents the comprehensive content, cognitive, and statistical blueprint for building parallel test forms on a single scale.</p><p>So how does Assessment Engineering task modeling address the challenges of formative assessment and MIRT modeling? There is a three-part answer to that question. First, we need to recognize that each scale is an independent progression of incrementally more complex task challenges. We need to map the proficiency claims for each scale as a unique construct independent of all other constructs. Assessment Engineering construct mapping details the progression of <i>domain-specific</i> proficiency-based claims and expected evidence to support those claims. Second, we develop task model maps that provide the collection of incrementally more complex task challenges along each domain-specific scale. Different scales are likely to represent different types of complexity design layers by design. Third, we do not complicate test design by adopting complicated MIRT models that we expect to absorb nuanced misfit within or across task model families.</p><p>There are three ways to generate the item families for each task model family: (1) using human item writers; (2) building or licensing automatic item generation software and customizing it for each domain of interest; or (3) employing large language models and generative artificial intelligence to produce the items.</p><p>Using human item writers can prove to be cost-effective when relatively small numbers of items per task model family are needed (e.g., fewer than 50 items per family). The item writers should use tightly controlled templates that limit their creativity in generating the variant items. In addition, active quality control procedures can be implemented that incorporate natural language processing tools and CDL-based feedback to help the item writers refine each family.</p><p>Automated item generation is implemented using a dedicated software application. High-quality parent items within each domain are <i>parameterized</i> by having subject-matter experts note particular segments (words, phrases, variables, or values) that can take on multiple values, subject to plausibility constraints (Embretson, <span>1999</span>; Embretson &amp; Kingston, <span>2018</span>; Gierl &amp; Haladyna, <span>2012</span>; Gierl &amp; Lai, <span>2012</span>; Gierl et al., <span>2012</span>). Natural language processing syntax and plausibility checking can also be implemented to screen out some of the generated items. AIG can produce hundreds or thousands of items from a given parent item.</p><p>Finally, the use of large language models and generative artificial intelligence is still relatively new for operational test development. There has been some success in language testing using large language models based on transformer architectures like OpenAI's <i>ChatGPT</i>™ and <i>Google's</i> BERT; however, most of these applications still rely on human review and refinement of the generated test content. A more in-depth discussion of these technologies and their relative merits is beyond the scope of this article.</p><p>The key point is that we must focus on the intended properties of the task model families. We can then replicate our procedures for the task models mapped to the other formative scales we are attempting to build (and maintain).</p><p>It is entirely possible to treat items as isomorphic (randomly exchangeable) within task model families and use a common-siblings calibration strategy where the response data are collapsed for each family (Shu et al., <span>2010</span>; Sinharay &amp; Johnson, 2003, <span>2013</span>). Of course, the assumption of isomorphism of the within-task model families item operating characteristics needs to be empirically verified to an adequate level of <i>tolerable</i> variance (e.g., Luecht, <span>2024</span>; Someshwar, <span>2024</span>). A further simplification would be to treat each formative trait as a separate scale for calibration purposes.</p><p>Simple structure can be imposed such that individual items and families only impact a single score scale (i.e., the item discrimination vector, <b><i>a</i></b><i><sub>f</sub></i>, has only one nonzero element corresponding to the domain two to which it belongs). There are practical calibration and scale maintenance benefits associated with isolating the item parameters for each domain using a simple structure paradigm. That is, under that paradigm, each item family effectively only has one discrimination corresponding to the intended dimension.</p><p>Estimating the hybrid model uses a single response vector for each family. The item-within-family difficulties are estimated from that same (collapsed) response vector. However, the estimation also requires a separate indexing variable for the item instances within a family. Only that subvector of responses is used to estimate each item's difficulty per the model.</p><p>It should be noted that there are alternatives to the proposed model in the paper that could be adapted to a MIRT context. These include the explanatory IRT models (EIRTM) proposed by De Boeck and Wilson (<span>2004</span>) and De Boeck et al. (<span>2011</span>), or extensions of Fischer's linear logistic test model (Fischer, <span>1995</span>) or Embretson's general multicomponent latent trait model (Embretson, <span>1984</span>).  It is important to emphasize that the model of choice—whatever it is—has to be able to deal with the notion of “item families,” rather than individual items—see Sinharay and Johnson (<span>2013</span>) and Geerlings et al. (<span>2011</span>). Item families are created by following a design and content generation process that produces consistent “principled” multidimensional information. It should further be noted that if properly implemented, we may not need overly complex models for operational formative testing.</p><p>We probably shouldn't rely on EIRTMs to “discover” the complexity dimensions that should go into task and item design. Good design is intentional—a tenet of Assessment Engineering and a core principle of industrial engineering. The corollary is that our psychometric models should confirm the intended scale design in as simple and straightforward a manner as possible.  EIRTMs typically incorporate person factors (demographic variables or covariates) that “explain” differential performance (person-by-item interactions). Additionally, one could apply EIRTMs to task model families instead of individual items.</p><p>MIRT modeling has largely been limited to simulation studies or empirical exploratory factor analytic studies with essentially unidimensional data and score scales. This article makes the somewhat bold assertation that the benefits of MIRT modeling can only be realized if we match its capabilities to a test purpose with supporting test design and development activities that require multiple score scales. Formative assessment may be that ideal application.</p><p>However, useful formative assessments come with some costs, figuratively and literally speaking. For example, it is not inconceivable that hundreds of isomorphic items might be needed for each task model family within each domain to support on-demand testing. That is, formative assessments should be on-demand and provide timely profiles of multiple, instructionally sensitive score scales that respond to good instruction, curricular design, and student learning. Getting the needed multidimensional measurement information is where principled frameworks like Assessment Engineering can help in terms of design, item production, efficient calibration of item families, and overall quality assurance for the banks of items.</p><p>By mapping out each construct (scale) and the cognitive complexity of task models needed to inform progress along them, we also build strong validity arguments into the scale design itself. By creating and then calibrating task model families rather than the individual items, we solve some of the item production complications needed to sustain on-demand testing in a formative assessment context. Well-designed task model families can further bypass pilot testing and the need to calibrate individual items.</p><p>The fact that we have to contend with multiple scales and MIRT modeling is not a complication under Assessment Engineering. We merely construct each scale from the ground up and create a synchronized psychometric and test development infrastructure that emphasizes scalable production of items and test forms with intentional ongoing quality control. Over time, through design modifications, our efforts to eliminate all but minor degrees of variation in the statistical operating characteristics within task model families imply and help ensure highly robust and formatively useful scales.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 4","pages":"93-100"},"PeriodicalIF":2.7000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12645","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Educational Measurement-Issues and Practice","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/emip.12645","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0

Abstract

Given tremendous improvements over the past three to four decades in the computational methods and computer technologies needed to estimate the parameters for higher dimensionality models (Cai, 2010a, 2010b, 2017), we might expect that MIRT would by now be a widely used array of models and psychometric software tools being used operationally in many educational assessment settings. Perhaps one of the few areas where MIRT has helped practitioners is in the area of understanding Differential Item Functioning (DIF) (Ackerman & Ma, 2024; Camilli, 1992; Shealy & Stout, 1993). Nevertheless, the expectation has not been met nor do there seem to be many operational initiatives to change the status quo.

Some research psychometricians might lament the lack of large-scale applications of MIRT in the field of educational assessment. However, the simple fact is that MIRT has not lived up to its early expectations nor its potential due to several barriers. Following a discussion of test purpose and metric design issues in the next section, we will examine some of the barriers associated with these topics and provide suggestions for overcoming or completely avoiding them.

Tests developed for one purpose are rarely of much utility for another purpose. For example, professional certification and licensure tests designed to optimize pass-fail classifications are often not very useful for reporting scores across a large proficiency range—at least not unless the tests are extremely long. Summative, and most interim assessments used in K–12 education, are usually designed to produce reliable total-test scores. The resulting scale scores are summarized as descriptive statistical aggregations of scale scores or other functions of the scores such as classifying students in ordered achievement levels (e.g., Below Basic, Basic, Proficient, Advanced), or in modeling student growth in a subject area as part of an educational accountability system. Some commercially available online “interim” assessments provide limited progress-oriented scores and subscores from on-demand tests. However, the defensible formative utility of most interim assessments remains limited because test development and psychometric analytics follow the summative assessment test design and development paradigm: focusing on maintaining vertically aligned or equated, unidimensional scores scales (e.g., a K–12 math scale).

The requisite test design and development frameworks for summative tests focus on the relationships between the item responses and the total test score scale (e.g., maximizing item-total score correlations and the conditional reliability within prioritized regions of that score scale).

Applying MIRT models to most summative or interim assessments makes little sense. The problem is that we continue to allow policymakers to make claims about score interpretations that are not supported by the test or scale design. The standards on most K–12 tests are not multidimensional. Rather, they are a taxonomy of unordered statements—many of which cannot be measured by typical test items—that vaguely reflect the intended scope of an assessment. Some work is underway to provide “assessment standards” that reflect an ordered set of proficiency claims and associated evidence (measurement information) that changes in complexity. The reported scores may be viewed as composite measures representing two or more content domains or subdomains. But they still tend to function as a unidimensional scale. A unidimensional composite can be a mixture of multiple subdomains or content areas as long as the underlying, unitary trait can be empirically demonstrated to satisfy local independence under a particular IRT model.

Most MIRT studies applied to summative and interim tests are just exploratory factor analyses. That is, the models may help isolate minor amounts of nuanced multidimensionality and researchers may then attempt to interpret the patterns of residual covariance in some content-focused way. However, whenever we develop and select items with high item-total score correlations (e.g., point-biserial correlations), we build our tests to provide a singular measurement signal—an essentially unidimensional scale. We might pretend that we can legitimately organize items into content-based strands and report subscores. However, the subscore item groupings tend to be statistically unjustified and merely result in less reliable estimates of the (essentially) unidimensional trait supported by the data (Haberman & Sinharay, 2010). The point is that subscores—any subscore on an essentially unidimensional test—should NOT be computed nor reported. Developing reliable, valid, and useful subscore profiles demands a commitment to designing and maintaining multiple scales.

Instead, consider a different and perhaps more useful formative assessment purpose—at least potentially useful to teachers, parents, and importantly, the students. While some conceptualize formative assessments as low-stakes classroom assessments, the critical value of these assessments for improving instruction and changing student learning in a positive way cannot be minimized.

Well-designed formative assessments should arguably be based on multiple metrics that are demonstrably sensitive to good instruction, curricular design, and student learning. They further need to be offered on-demand—possibly daily—and provide immediate or at least timely, detailed, and pedagogically, actionable information to teachers. From a test design and development perspective, the implication is that formative assessments must provide useful and informative performance profiles for individual students that reliably identify valid student strengths on which to build and weaknesses to remediate, as well as simultaneously monitoring progress concerning multiple traits or competency-based metrics.

The central uses of formative assessments align well with the capabilities of MIRT modeling where the latter provides numerous technical psychometric tools for building and maintaining multiple score scales. [Note: this statement extends to diagnostic classification models (DCMs) where discrete, ordered traits or attributes replace the continuous proficiency metrics assumed for most MIRT models. See, for example, Sessoms and Henson (2018). One promising new approach for diagnostic DCM is by Stout et al. (2023).] The challenge is that adopting a MIRT model or DCM is not, in and of itself, a formative assessment solution. A different test design and development paradigm is needed.

At this juncture, it seems important to remind ourselves that we fit psychometric models to data—not the reverse. Therefore, the important issues in this article are not centered on which MIRT or DCM model to use or which statistical parameter estimators to employ. Those are psychometric calibration and scaling choices. The most important issues revolve around the characteristics of the data—starting with how we can efficiently design and create items, and then assemble test forms that can meet our intended formative assessment information needs. Our extensive experience with, and research-based knowledge from, large-scale summative testing may not apply to formative assessment systems. For example, we need to consider different mechanisms for evaluating item quality, calibrating the items, linking or equating scales, and scoring student performances.

If we can agree on the utility of formative assessments and the ensuing need for multiple constructs, the obvious question becomes, “Which constructs should we measure?” It is not sufficient to write test questions to nebulous content and/or cognitive specifications associated with content-based subdomains, factor analyze the results from large-scale field trials, and then play the “Name that Factor” game. Each scale needs to have a concrete purpose supported by its design properties and development priorities.

Figure 1 displays the four primary domains from the Common Core State Standards (CCSS) for Grade 2 Mathematics (NGA & CCSSO, 2010). Additional detail is provided at the Clusters and Standards levels for the Measurement & Data domain.

Now contemplate this CCSS example from the perspective of a formative assessment design. At the very least, we would need four score scales, one for each domain (2.OA, 2.NBT, 2.MD, and 2.G). While likely to be positively correlated with one another, it seems implausible that most second-grade students would have the same educational opportunities to learn and almost identical levels of mathematics knowledge and skills—or even highly consistent patterns of performance across these four domain-based proficiencies. Often, dimensionality is correlated with the placement of students with learning. Before receiving instruction or long after mastering the material, data appear unidimensional. It is only when students are actively challenged by learning that dimensions appear. A well-designed formative assessment system might expect to observe different score patterns to emerge across the four domains reflecting different patterns of student strengths and weaknesses concerning the domain-specific knowledge and skills measured.

Consider Figure 2. The left side of the figure depicts the intended structure. That is, the four ellipses are the constructs with the curved connectors denoting nonzero covariances among the scales. The middle image shows the potential magnitude of the six covariances between the traits—which would be proportional to the cosines of the angles between each domain-based scale. From a measurement perspective, each trait is a factor or reference composite (Luecht & Miller, 1992; Wang, 1986) that psychometrically functions as a unique scale. Finally, the right side shows the score profiles for three students. This figure outlines a high-level target formative assessment scale design!

Figure 3 explicitly shows more detail about how our test and scale design goals are substantially different for unidimensional (summative or interim) and formative assessments. Under a unidimensional design paradigm (left side of Figure 3) all items within each of the included content domains are expected to be highly correlated with one another (+++) and with all items in other domains. Conversely, for a formative test designed to measure four rather distinct domains (right side of Figure 2). The items would correlate well within a domain but not as highly across domains.

These correlational patterns are intentional design goals. They are not accidental nor are they likely to be “discovered” via creative factor analyses. We need to build our items and the scales synchronously so that statistical structural modeling confirms the intended scale structures and properties of the test items. If we subsequently find that our target structures are not being met, we will likely need to impose serious item design constraints on the cognitive complexity of the items within each domain to deemphasize and isolate the intended signals for our constructs.

The nature of the (intended) multidimensionality underlying may also change over time as shown in Figure 4. This figure provides an example of a high-level scale-design requirement that might support a domain-specific formative assessment system to be used during a single semester or entire academic year. Some foundational knowledge and skill domains may become more prominent early in the semester and disappear or be absorbed later on. See Luecht (2003) for a substantive example involving the changing dimensionality of oral language proficiency. Figure 4 shows four epochs (e.g., time periods that could correspond to different learning modules or the scope and sequence of learning objectives). There are also five measured traits or constructs (θ1, θ2,…,θ5). The structural diagrams near the top of each rectangle denote the constructs and their inter-relationships (i.e., covariances). The double-ended arrows denote the nature of those covariances where smaller angles correspond with higher covariances or correlations, proportional to the cosines of the angles.

Figure 4 suggests that we need two scales for Epoch #1 θ1 and θ2. We transition to three scales to profile student strengths and weaknesses at Epoch #2, where θ1 and θ2 become more highly correlated and a third scale, θ3, is needed. (Note that the rationale for the increased magnitude of the correlation between θ1 and θ2 as we move from Epoch #1 to #2 is, as students learn and simultaneously master two or more traits, we would expect the inter-trait correlations to increase toward a unidimensional composite.) At Epoch #3, θ1 drops out of the intended multidimensional score profile and θ4 emerges—with θ2 and θ3 now becoming more correlated with one another. Finally, by Epoch #4, θ1, θ2, and θ3 coalesce, and a fifth trait, θ5, is added to the measurement profile.

The type of dynamic score-scale profile design implied by Figure 4 is unlikely to happen by starting with an item bank constructed to primarily support a unidimensional score scale (e.g., a summative test item bank or a simply a bank where items are screened have the highest possible item-total test score correlations). Building and maintaining multiple useful scales to support robust score profiles starts with an in-depth understanding of the nature of each construct at a very detailed level. We are essentially building multiple test batteries (e.g., a battery for each epoch shown in Figure 4). This is not a trivial undertaking but, as discussed further on, is feasible using a principled scale and item design framework.

The test design and development challenges for building and maintaining multidimensional score scales to support formative assessment are not trivial. There can be substantial up-front costs to engineer a robust infrastructure that can: (a) generate potentially massive item banks: (b) support the intended multidimensional score scale structure(s); (c) significantly reduce or eliminate pilot-testing and calibration of individual items; and (d) support optimal within-domain test assembly constraints and forms construction for on-demand testing, possible with multiple epochs (Figure 4). The longer-term payoffs emerge from integrating multidimensional modeling efforts with a robust architecture for generating isomorphic items to detailed task model specifications.

Following the Assessment Engineering framework, Luecht and Burke (2020) proposed a test design and development paradigm that concentrates scale design on manipulable properties of the tasks (also see Luecht et al., 2010; Luecht, 2012a, 2012b, 2013, 2016). Assessment Engineering advocates for laying out in a detailed fashion the ordered proficiency claims for each scale. For example, foundational claims about knowledge and skills are superseded by incrementally more rigorous claims. The core Assessment Engineering item and test development technology is called task modeling.

Task modeling focuses on engineering two critical aspects of instrument design. First, it incorporates item-difficulty modeling research to establish empirically verified complexity design layers that control the cognitive complexity and difficulty of entire families of items called task-model families. Note that discrimination is NOT something that we care about directly with item difficulty modeling. By controlling cognitive complexity via the complexity design layers, we can statistically locate each task-model family on the underlying scale and further use them as quality control indicators for new items as they are generated for each family. Items within each task-model family are treated as statistically and substantively isomorphic (i.e., exchangeable use for scoring and score interpretation purposes as evidence of specific proficiency claims). As discussed further on, this isomorphic property—subject to empirical verification—has enormous benefits for using “data-hungry” MIRT models as part of an operational formative assessment system. The second aspect involves developing a task model map for each scale. Task model maps replace more traditional content blueprints with a highly detail test assembly specification that simultaneously represents the test-form content and the statistical measurement information target(s).

Figure 5 depicts the alignment between items, task models, and complexity design layers that control the difficulties (locations) of the task model families. Task complexity is increased (left to right) along each class of complexity design layer metrics. Each task model family is represented by one or more item models that can generate multiple, isomorphic instantiations. The task model map represents the comprehensive content, cognitive, and statistical blueprint for building parallel test forms on a single scale.

So how does Assessment Engineering task modeling address the challenges of formative assessment and MIRT modeling? There is a three-part answer to that question. First, we need to recognize that each scale is an independent progression of incrementally more complex task challenges. We need to map the proficiency claims for each scale as a unique construct independent of all other constructs. Assessment Engineering construct mapping details the progression of domain-specific proficiency-based claims and expected evidence to support those claims. Second, we develop task model maps that provide the collection of incrementally more complex task challenges along each domain-specific scale. Different scales are likely to represent different types of complexity design layers by design. Third, we do not complicate test design by adopting complicated MIRT models that we expect to absorb nuanced misfit within or across task model families.

There are three ways to generate the item families for each task model family: (1) using human item writers; (2) building or licensing automatic item generation software and customizing it for each domain of interest; or (3) employing large language models and generative artificial intelligence to produce the items.

Using human item writers can prove to be cost-effective when relatively small numbers of items per task model family are needed (e.g., fewer than 50 items per family). The item writers should use tightly controlled templates that limit their creativity in generating the variant items. In addition, active quality control procedures can be implemented that incorporate natural language processing tools and CDL-based feedback to help the item writers refine each family.

Automated item generation is implemented using a dedicated software application. High-quality parent items within each domain are parameterized by having subject-matter experts note particular segments (words, phrases, variables, or values) that can take on multiple values, subject to plausibility constraints (Embretson, 1999; Embretson & Kingston, 2018; Gierl & Haladyna, 2012; Gierl & Lai, 2012; Gierl et al., 2012). Natural language processing syntax and plausibility checking can also be implemented to screen out some of the generated items. AIG can produce hundreds or thousands of items from a given parent item.

Finally, the use of large language models and generative artificial intelligence is still relatively new for operational test development. There has been some success in language testing using large language models based on transformer architectures like OpenAI's ChatGPT™ and Google's BERT; however, most of these applications still rely on human review and refinement of the generated test content. A more in-depth discussion of these technologies and their relative merits is beyond the scope of this article.

The key point is that we must focus on the intended properties of the task model families. We can then replicate our procedures for the task models mapped to the other formative scales we are attempting to build (and maintain).

It is entirely possible to treat items as isomorphic (randomly exchangeable) within task model families and use a common-siblings calibration strategy where the response data are collapsed for each family (Shu et al., 2010; Sinharay & Johnson, 2003, 2013). Of course, the assumption of isomorphism of the within-task model families item operating characteristics needs to be empirically verified to an adequate level of tolerable variance (e.g., Luecht, 2024; Someshwar, 2024). A further simplification would be to treat each formative trait as a separate scale for calibration purposes.

Simple structure can be imposed such that individual items and families only impact a single score scale (i.e., the item discrimination vector, af, has only one nonzero element corresponding to the domain two to which it belongs). There are practical calibration and scale maintenance benefits associated with isolating the item parameters for each domain using a simple structure paradigm. That is, under that paradigm, each item family effectively only has one discrimination corresponding to the intended dimension.

Estimating the hybrid model uses a single response vector for each family. The item-within-family difficulties are estimated from that same (collapsed) response vector. However, the estimation also requires a separate indexing variable for the item instances within a family. Only that subvector of responses is used to estimate each item's difficulty per the model.

It should be noted that there are alternatives to the proposed model in the paper that could be adapted to a MIRT context. These include the explanatory IRT models (EIRTM) proposed by De Boeck and Wilson (2004) and De Boeck et al. (2011), or extensions of Fischer's linear logistic test model (Fischer, 1995) or Embretson's general multicomponent latent trait model (Embretson, 1984).  It is important to emphasize that the model of choice—whatever it is—has to be able to deal with the notion of “item families,” rather than individual items—see Sinharay and Johnson (2013) and Geerlings et al. (2011). Item families are created by following a design and content generation process that produces consistent “principled” multidimensional information. It should further be noted that if properly implemented, we may not need overly complex models for operational formative testing.

We probably shouldn't rely on EIRTMs to “discover” the complexity dimensions that should go into task and item design. Good design is intentional—a tenet of Assessment Engineering and a core principle of industrial engineering. The corollary is that our psychometric models should confirm the intended scale design in as simple and straightforward a manner as possible.  EIRTMs typically incorporate person factors (demographic variables or covariates) that “explain” differential performance (person-by-item interactions). Additionally, one could apply EIRTMs to task model families instead of individual items.

MIRT modeling has largely been limited to simulation studies or empirical exploratory factor analytic studies with essentially unidimensional data and score scales. This article makes the somewhat bold assertation that the benefits of MIRT modeling can only be realized if we match its capabilities to a test purpose with supporting test design and development activities that require multiple score scales. Formative assessment may be that ideal application.

However, useful formative assessments come with some costs, figuratively and literally speaking. For example, it is not inconceivable that hundreds of isomorphic items might be needed for each task model family within each domain to support on-demand testing. That is, formative assessments should be on-demand and provide timely profiles of multiple, instructionally sensitive score scales that respond to good instruction, curricular design, and student learning. Getting the needed multidimensional measurement information is where principled frameworks like Assessment Engineering can help in terms of design, item production, efficient calibration of item families, and overall quality assurance for the banks of items.

By mapping out each construct (scale) and the cognitive complexity of task models needed to inform progress along them, we also build strong validity arguments into the scale design itself. By creating and then calibrating task model families rather than the individual items, we solve some of the item production complications needed to sustain on-demand testing in a formative assessment context. Well-designed task model families can further bypass pilot testing and the need to calibrate individual items.

The fact that we have to contend with multiple scales and MIRT modeling is not a complication under Assessment Engineering. We merely construct each scale from the ground up and create a synchronized psychometric and test development infrastructure that emphasizes scalable production of items and test forms with intentional ongoing quality control. Over time, through design modifications, our efforts to eliminate all but minor degrees of variation in the statistical operating characteristics within task model families imply and help ensure highly robust and formatively useful scales.

Abstract Image

还对多维项目反应理论建模感兴趣吗?这里有一些关于如何在实践中发挥作用的想法
考虑到过去三、四十年来估算高维模型参数所需的计算方法和计算机技术的巨大进步(Cai, 2010a, 2010b, 2017),我们可以预期,到目前为止,MIRT将成为广泛使用的一系列模型和心理测量软件工具,在许多教育评估环境中可操作性地使用。也许MIRT帮助实践者的少数几个领域之一是理解差异项目功能(DIF)领域(Ackerman &amp;马,2024;Camilli, 1992;谢伊,结实的,1993)。然而,期望并没有得到满足,似乎也没有采取许多行动来改变现状。一些研究心理测量学家可能会对MIRT在教育评估领域缺乏大规模应用感到遗憾。然而,一个简单的事实是,由于一些障碍,MIRT并没有达到其早期的期望和潜力。在下一节中讨论测试目的和度量设计问题之后,我们将检查与这些主题相关的一些障碍,并提供克服或完全避免它们的建议。为一个目的开发的测试很少对另一个目的有用。例如,设计用于优化合格-不合格分类的专业认证和执照测试对于报告大范围的熟练程度分数通常不是很有用——至少除非测试非常长。总结性评估和K-12教育中使用的大多数中期评估通常旨在产生可靠的总测试分数。由此产生的量表分数被总结为量表分数的描述性统计汇总或分数的其他功能,例如按顺序的成就水平对学生进行分类(例如,低于基本,基本,熟练,高级),或在一个学科领域为学生的成长建模,作为教育问责制的一部分。一些商业上可获得的在线“临时”评估提供有限的以进步为导向的分数和按需测试的子分数。然而,大多数中期评估的可辩护的形成效用仍然有限,因为测试开发和心理测量分析遵循总结性评估测试设计和开发范式:专注于保持垂直对齐或相等的单向度分数量表(例如,K-12数学量表)。总结性测试的必要测试设计和开发框架侧重于项目反应和总测试分数表之间的关系(例如,最大化项目-总分的相关性和分数表优先区域内的条件可靠性)。将MIRT模型应用于大多数总结性或中期评估几乎没有意义。问题是,我们继续允许政策制定者对测试或量表设计不支持的分数解释提出主张。大多数K-12考试的标准都不是多维的。相反,它们是一种无序语句的分类——其中许多不能通过典型的测试项目来衡量——它们模糊地反映了评估的预期范围。一些工作正在进行中,以提供“评估标准”,这些标准反映了一组有序的熟练程度声明和相关的证据(测量信息),这些证据的复杂性在变化。报告的分数可以看作是代表两个或更多内容域或子域的综合度量。但它们仍然倾向于以单维尺度发挥作用。一个单维的组合可以是多个子域或内容区域的混合,只要在一个特定的IRT模型下,底层的、统一的特征可以被经验地证明是满足局部独立性的。大多数用于总结性和中期试验的MIRT研究只是探索性因素分析。也就是说,这些模型可以帮助分离少量细微的多维度,然后研究人员可以尝试以某种以内容为中心的方式解释剩余协方差的模式。然而,每当我们开发和选择具有高项目总分相关性(例如,点双列相关性)的项目时,我们构建我们的测试来提供一个单一的测量信号——本质上是一个单维尺度。我们可能会假装我们可以合法地将项目组织到基于内容的链中并报告子分数。然而,子得分项目分组往往在统计上不合理,只会导致对数据支持的(本质上)单维特征的不可靠估计(Haberman &amp;Sinharay, 2010)。关键是子分数——本质上是单维测试中的任何子分数——不应该被计算或报告。开发可靠、有效和有用的子分数配置文件需要致力于设计和维护多个量表。相反,考虑一个不同的,也许更有用的形成性评估目的——至少对老师,家长,更重要的是,对学生有潜在的帮助。 虽然有些人将形成性评估概念化为低风险的课堂评估,但这些评估对改善教学和以积极的方式改变学生学习的关键价值不能被低估。设计良好的形成性评估应该基于多种指标,这些指标显然对良好的教学、课程设计和学生学习非常敏感。它们还需要按需提供——可能是每天提供——并向教师提供即时或至少是及时的、详细的、在教学上可操作的信息。从测试设计和开发的角度来看,这意味着形成性评估必须为单个学生提供有用的和信息丰富的表现概况,可靠地确定有效的学生优势,建立和弥补弱点,以及同时监控有关多个特征或基于能力的度量的进展。形成性评估的核心用途与MIRT建模的能力很好地结合在一起,后者提供了许多技术心理测量工具,用于构建和维护多个分数量表。[注:此声明扩展到诊断分类模型(dcm),其中离散的、有序的特征或属性取代了大多数MIRT模型中假定的连续熟练度度量。例如,参见Sessoms和Henson(2018)。Stout等人(2023)提出了一种有希望的诊断DCM的新方法。挑战在于,采用MIRT模型或DCM本身并不是形成性评估解决方案。需要一个不同的测试设计和开发范例。在这个关键时刻,似乎有必要提醒我们自己,我们让心理测量模型与数据相匹配,而不是相反。因此,本文的重要问题不在于使用哪个MIRT或DCM模型,或者使用哪个统计参数估计器。这些是心理测量校准和尺度选择。最重要的问题围绕着数据的特征——从我们如何有效地设计和创建项目开始,然后组装能够满足我们预期的形成性评估信息需求的测试表单。我们在大规模总结性测试方面的丰富经验和基于研究的知识可能不适用于形成性评估系统。例如,我们需要考虑不同的机制来评估项目质量、校准项目、连接或等同量表,以及为学生的表现打分。如果我们能够同意形成性评估的效用以及随后对多个构式的需求,那么显而易见的问题就变成了,“我们应该测量哪些构式?”针对模糊的内容和/或与基于内容的子领域相关的认知规范编写测试问题,对大规模现场试验的结果进行因素分析,然后玩“命名那个因素”游戏是不够的。每个规模都需要有一个具体的目的,由其设计属性和开发优先级支持。图1显示了二年级数学共同核心国家标准(CCSS)的四个主要领域。CCSSO, 2010)。在集群和标准级别提供了额外的详细信息。数据域。现在从形成性评估设计的角度来思考这个CCSS的例子。至少,我们需要四个评分尺度,每个域一个(2)。OA, 2。电视台,2。MD,和2.G)。虽然可能彼此呈正相关,但大多数二年级学生有相同的教育机会学习几乎相同的数学知识和技能水平,甚至在这四种基于领域的熟练程度上表现出高度一致的模式,这似乎是不可能的。通常,维度与学生在学习中的位置有关。在接受指导之前或掌握材料之后很长一段时间,数据都是单一性的。只有当学生受到学习的积极挑战时,维度才会出现。一个设计良好的形成性评估系统可能期望在四个领域中观察到不同的得分模式,反映出学生在特定领域的知识和技能方面的优势和劣势的不同模式。考虑图2。图的左侧描绘了预期的结构。也就是说,这四个椭圆是带有弯曲连接器的结构,表示尺度之间的非零协方差。中间的图像显示了特征之间的六个协方差的潜在幅度,这将与每个基于域的尺度之间的角度余弦成正比。从测量的角度来看,每个特征都是一个因素或参考组合(Luecht &amp;米勒,1992;Wang, 1986),心理测量学是一种独特的量表。最后,右侧显示了三个学生的成绩概况。 此图概述了一个高层次的目标形成性评估量表设计!图3明确地显示了更多关于我们的测试和规模设计目标在单维(总结性或中期)和形成性评估中是如何本质不同的细节。在单维设计范式下(图3左侧),每个包含的内容域中的所有项都期望彼此之间(+++)以及与其他域中的所有项高度相关。相反,对于设计用来测量四个相当不同的领域的形成性测试(图2的右侧)。这些项目在一个领域内会很好地相关,但在不同领域之间的相关性不高。这些相关的模式是有意的设计目标。它们不是偶然的,也不太可能通过创造性因素分析被“发现”。我们需要同步构建我们的项目和尺度,以便统计结构建模确认测试项目的预期规模结构和属性。如果我们随后发现我们的目标结构没有被满足,我们可能需要对每个领域内的项目的认知复杂性施加严格的项目设计约束,以降低和隔离我们的结构的预期信号。潜在的(预期的)多维性的性质也可能随着时间而改变,如图4所示。该图提供了一个高级规模设计需求的示例,该需求可能支持在单个学期或整个学年中使用特定于领域的形成性评估系统。一些基础知识和技能领域可能在学期早期变得更加突出,然后消失或被吸收。参见Luecht(2003)关于口头语言能力维度变化的实质性例子。图4显示了四个时期(例如,可能对应于不同学习模块或学习目标的范围和顺序的时间段)。还有五种测量特征或构念(θ1, θ2,…,θ5)。每个矩形顶部附近的结构图表示构造及其相互关系(即协方差)。双端箭头表示这些协方差的性质,其中较小的角度对应较高的协方差或相关性,与角度的余弦成正比。图4表明,我们需要两个Epoch #1 θ1和θ2的尺度。在第二阶段,我们使用三个量表来描述学生的优势和劣势,其中θ1和θ2变得更加高度相关,需要第三个量表θ3。(请注意,θ1和θ2之间的相关性随着我们从第1时代进入第2时代而增加的基本原理是,当学生学习并同时掌握两种或两种以上的特征时,我们预计特征之间的相关性会增加到单维复合。)在第三阶段,θ1从预期的多维得分曲线中退出,而θ4出现了——θ2和θ3现在变得更加相互关联。最后,通过Epoch #4, θ1、θ2和θ3合并,并将第五个特征θ5添加到测量剖面中。图4所示的动态分数量表配置文件设计的类型不太可能从主要支持一维分数量表的题库开始(例如,总结性测试题库或简单的题库,其中筛选的项目具有最高的项目-总测试分数相关性)。构建和维护多个有用的尺度以支持强大的分数配置文件,首先要在非常详细的级别上深入了解每个结构的性质。我们实际上是在构建多个测试电池(例如,如图4所示的每个时代一个电池)。这不是一项微不足道的工作,但是,正如进一步讨论的那样,使用原则规模和项目设计框架是可行的。构建和维护多维分数尺度以支持形成性评估的测试设计和开发挑战并不是微不足道的。设计一个健壮的基础设施可能需要大量的前期成本,这个基础设施可以:(a)产生潜在的大规模题库;(b)支持预期的多维评分结构;(c)大幅减少或取消个别项目的试点测试和校正;以及(d)支持最优的域内测试装配约束和按需测试的表单构建,可能有多个时期(图4)。长期的回报来自于将多维建模工作与用于生成详细任务模型规范的同构项的健壮体系结构集成。在评估工程框架之后,Luecht和Burke(2020)提出了一种测试设计和开发范式,该范式将规模设计集中在任务的可操作属性上(另见Luecht等人,2010;Luecht, 2012a, 2012b, 2013, 2016)。评估工程提倡以一种详细的方式为每个等级安排有序的熟练程度要求。 例如,关于知识和技能的基本要求被逐渐严格的要求所取代。核心的评估工程项目和测试开发技术称为任务建模。任务建模侧重于工程仪器设计的两个关键方面。首先,它结合了项目难度建模研究,建立了经验验证的复杂性设计层,这些设计层控制了被称为任务模型族的整个项目族的认知复杂性和难度。注意,歧视不是我们在道具难度建模中直接关心的问题。通过复杂性设计层控制认知复杂性,我们可以在底层尺度上统计定位每个任务模型族,并进一步使用它们作为新项目的质量控制指标,因为它们是为每个家族生成的。每个任务模型家族中的项目在统计上和实质上是同构的(即,作为特定熟练程度声明的证据,用于评分和分数解释目的的可交换使用)。正如进一步讨论的那样,这种同构属性(需要经过经验验证)对于使用“数据饥渴”的MIRT模型作为操作形成性评估系统的一部分具有巨大的好处。第二个方面涉及为每个尺度开发一个任务模型图。任务模型映射用高度详细的测试组装规范取代了更传统的内容蓝图,该规范同时表示测试形式内容和统计度量信息目标。图5描述了项目、任务模型和控制任务模型族的难度(位置)的复杂性设计层之间的一致性。任务复杂性沿着每一类复杂性设计层度量增加(从左到右)。每个任务模型族都由一个或多个项目模型表示,这些项目模型可以生成多个同构实例。任务模型图代表了在单一比例尺上构建并行测试表单的综合内容、认知和统计蓝图。那么评估工程任务建模如何处理形成性评估和MIRT建模的挑战呢?这个问题的答案可以分为三部分。首先,我们需要认识到,每一个尺度都是一个独立的进程,是越来越复杂的任务挑战。我们需要将每个量表的熟练程度要求映射为独立于所有其他构念的唯一构念。评估工程构建映射详细描述了领域特定的基于熟练程度的声明和支持这些声明的预期证据的进展。其次,我们开发任务模型映射,在每个领域特定的尺度上提供增量的更复杂的任务挑战的集合。不同的尺度可能代表不同类型的复杂设计层。第三,我们不会通过采用复杂的MIRT模型来使测试设计复杂化,我们期望这些模型能够吸收任务模型家族内部或跨任务模型家族的细微不匹配。为每个任务模型族生成项目族有三种方法:(1)使用人工项目编写者;(2)构建或许可自动项目生成软件,并针对每个感兴趣的领域进行定制;或者(3)使用大型语言模型和生成式人工智能来生产项目。当每个任务模型族需要相对较少的项目时(例如,每个任务模型族少于50个项目),使用人工项目编写者可以证明是具有成本效益的。条目编写者应该使用严格控制的模板,以限制他们在生成可变条目时的创造力。此外,积极的质量控制程序可以实现,包括自然语言处理工具和基于cdl的反馈,以帮助项目编写者改进每个家庭。使用专用的软件应用程序实现自动项目生成。每个领域内的高质量父条目都是参数化的,通过让主题专家注意到特定的片段(单词、短语、变量或值),这些片段可以接受多个值,并受到合理性约束(Embretson, 1999;Embretson,金斯顿,2018;Gierl,Haladyna, 2012;Gierl,赖,2012;Gierl等人,2012)。还可以实现自然语言处理语法和合理性检查,以筛选出一些生成的项目。AIG可以从给定的父项目中产生数百或数千个项目。最后,对于操作测试开发来说,大型语言模型和生成式人工智能的使用仍然是相对较新的。在使用基于转换器架构(如OpenAI的ChatGPT™和谷歌的BERT)的大型语言模型进行语言测试方面已经取得了一些成功;然而,这些应用程序中的大多数仍然依赖于人工审查和改进生成的测试内容。对这些技术及其相对优点的更深入讨论超出了本文的范围。 关键的一点是我们必须关注任务模型族的预期属性。然后,我们可以复制我们的过程,将任务模型映射到我们试图构建(和维护)的其他形成尺度。完全有可能将任务模型族中的项目视为同构(随机交换),并使用公共兄弟校准策略,其中每个家庭的响应数据被折叠(Shu等人,2010;Sinharay,Johnson, 2003, 2013)。当然,任务内模型族项目操作特征的同构假设需要经验验证到足够的可容忍方差水平(例如,Luecht, 2024;Someshwar, 2024)。进一步的简化是将每个形成特征作为一个单独的刻度来校准。可以施加简单的结构,使单个项目和家庭仅影响单个评分量表(即,项目判别向量af只有一个非零元素,对应于它所属的域2)。使用简单的结构范例隔离每个域的项目参数,具有实际的校准和规模维护好处。也就是说,在该范式下,每个项目族实际上只有一个与预期维度相对应的歧视。估计混合模型使用一个单一的响应向量为每个家庭。家庭内部项目的困难是根据相同的(崩溃的)反应向量估计的。但是,估计还需要一个单独的索引变量,用于类中的项实例。只有响应的子向量被用来估计每个模型中每个项目的难度。应该指出的是,本文中提出的模型有一些可用于MIRT上下文的替代方案。这些模型包括De Boeck和Wilson(2004)以及De Boeck等人(2011)提出的解释性IRT模型(EIRTM),或Fischer的线性逻辑检验模型(Fischer, 1995)的扩展,或Embretson的一般多成分潜在特质模型(Embretson, 1984)。重要的是要强调,选择模型——无论它是什么——必须能够处理“项目族”的概念,而不是单个项目——见Sinharay和Johnson(2013)和Geerlings等人(2011)。项目族是通过遵循产生一致的“原则性”多维信息的设计和内容生成过程来创建的。应该进一步注意的是,如果正确地实现,我们可能不需要过于复杂的模型来进行可操作的形成性测试。我们可能不应该依赖eirtm来“发现”应该进入任务和项目设计的复杂性维度。好的设计是有意的,这是评价工程的宗旨,也是工业工程的核心原则。推论是,我们的心理测量模型应该以尽可能简单和直接的方式确认预期的量表设计。eirtm通常包含“解释”差异表现(人与人之间的相互作用)的人因素(人口统计变量或协变量)。此外,可以将eirtm应用于任务模型族,而不是单个项目。MIRT建模在很大程度上局限于模拟研究或经验探索性因素分析研究,本质上是一维数据和评分量表。本文大胆地断言,只有当我们将MIRT建模的功能与需要多个分数尺度的测试设计和开发活动相匹配时,才能实现MIRT建模的好处。形成性评估可能是理想的应用。然而,有用的形成性评估是有代价的,无论是从形象上还是从字面上来说。例如,每个领域中的每个任务模型族可能需要数百个同构项来支持按需测试,这并不是不可想象的。也就是说,形成性评估应该是随需应变的,并及时提供多种、对教学敏感的分数尺度,以响应良好的教学、课程设计和学生学习。获得所需的多维测量信息是像Assessment Engineering这样的原则框架可以在设计、项目生产、项目族的有效校准和项目库的整体质量保证方面提供帮助的地方。通过绘制出每个结构(量表)和任务模型的认知复杂性,我们也为量表设计本身建立了强有力的有效性论证。通过创建并校准任务模型族而不是单个项目,我们解决了在形成性评估环境中维持按需测试所需的一些项目生产复杂性。设计良好的任务模型族可以进一步绕过试点测试和校准单个项目的需要。 在评估工程中,我们必须与多尺度和MIRT建模相抗衡的事实并不是一个复杂的问题。我们只是从头开始构建每个量表,并创建一个同步的心理测量和测试开发基础设施,强调项目和测试表单的可伸缩生产,并有意进行质量控制。随着时间的推移,通过设计修改,我们努力消除任务模型族中统计操作特征的所有变化,但程度较小,这意味着并有助于确保高度稳健和形成有用的量表。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.90
自引率
15.00%
发文量
47
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信