In the beginning, there was an item…

IF 1.9 4区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Educational Measurement-Issues and Practice Pub Date : 2024-11-14 DOI:10.1111/emip.12647

Deborah J. Harris, Catherine J. Welch, Stephen B. Dunbar

{"title":"In the beginning, there was an item…","authors":"Deborah J. Harris, Catherine J. Welch, Stephen B. Dunbar","doi":"10.1111/emip.12647","DOIUrl":null,"url":null,"abstract":"As educational researchers, we take scored item responses, create data sets to analyze, draw inferences from those analyses, and make decisions, about students’ educational knowledge and future success, judge how successful educational programs are, determine what to teach tomorrow, and so on. It is good to remind ourselves that the basis for all our analyses, from simple means to complex multilevel, multidimensional modeling, interpretations of those analyses, and decisions we make based on the analyses are at the core based on a test taker responding to an item. With all the emphasis on modeling, analyses, big data, machine learning, etc., we need to remember it all starts with the items we collect information on. If we get those wrong, then the results of subsequent analyses are unlikely to provide the information we are seeking.It is true that how students and educators interact with items has changed, and continues to change. More and more of the student-item interactions are happening online, and the days when an educator had relatively easy access to the actual test items, often after test administration, are in the past. This lack of access is also true for the researchers analyzing the response data: instead of a single test booklet aligned to a data file of test taker responses, there are large pools of items, and while the researcher may know a test taker was administered, say, item #SK-65243-0273A and what the response was, they do not know what the text of the item actually was, which can make it challenging to interpret analysis results at times.From having a test author write the items for an assessment, to contracting with content specialists to draft items, to cloning items from a template, to having large language models/artificial intelligence produce items, item development has morphed over the past and present, and will continue to morph into the future. Item tryouts for pretesting the quality and functioning of an item, including gathering data for generating item statistics to aid in forms construction and in some instances scoring, now attempt to develop algorithms that can accurately predict item characteristics, including item statistics, without gathering item data in advance of operational use (or at all). We are developing more innovative item types, and collecting more data, such as latencies, click streams, and other process data on student responses to those items.Sometimes we are so enamored of what we can do with the data, the analyses seem distant from the actual experience: a test taker responding to an item. And this makes it challenging at times to interpret analysis results in terms of actionable steps. Our aim here is to examine the evolution of how items are developed and considered, concentrating on large-scale, K–12 educational assessments.The Standards for Educational and Psychological Testing (Standards; American Educational Research Association [AERA], the American Psychological Association [APA], and the National Council on Measurement in Education [NCME], 1966, 1974, 1985, 1999, 2014) have been the guiding principles for test developers and educational measurement specialists for decades. The Standards have evolved over time. They require consensus from three major organizations: APA, AERA, and NCME, which incorporate considerations of multiple viewpoints. The earliest editions seem to somewhat neglect treatment of individual items, concentrating instead on the collection of items or test forms. In keeping with The Past, Present, and Future of Educational Measurement theme, we use the five editions of the Standards to examine how the focus on items has morphed over the years, and to look ahead to the future edition of the Standards currently under development, and how they conceptualize issues related to ‘items’.Our intent is to focus attention clearly on items, in all their formats, as the basis for our measurement decisions. The items test takers respond to are at the center of the data we collect, analyze, and interpret. And yet, at times the items seem very far from our focus. Graduate students in educational measurement typically have a general class on measurement early in their training, which covers basic foundational concepts such as reliability and validity, and typically, some treatment of item writing is often included, usually dealing with multiple choice items and good and bad item writing tips. Constructed response items and rubrics may also be covered. However, as students enter more advanced courses, it seems item statistics, p-values, point-biserials, IRT parameter estimates, are where the focus is. The actual text of items may not even be presented as “good” statistical properties, item bias and item fit statistics are discussed, and items in an assessment are retained or discarded based solely on statistical properties, ignoring the item text. There may be even more distance from the actual items as students generate data for simulation studies and get used to dealing with zeros and ones, removed from the actual items test takers respond to.The Standards have evolved to reflect and address changes in the field of testing. With respect to the item development process, an expansion of the issues and content can be seen in the evolution along with a restructuring and repositioning of the areas of importance. This paper explores the evolution within the context of item development.In the 1966 edition, topics pertaining to validity and reliability were emphasized and considered essential. Within the chapter devoted to validity, issues pertaining to content validity included standards that addressed the representativeness of items in constructing a test, the role of experts in the selection and review of items, and the match between items and test specifications were covered. For large-scale achievement tests, item writing was assumed to be completed by subject-matter specialists that “devise” and select items judged to cover the topics and processes related to the assessment being produced. Agreement among independent judgments when individual items are being selected by experts was also emphasized (Standard C3.11). However, details defining a process to follow in the item writing process were not emphasized. Similar issues pertaining to item writing were included in the 1974 edition. However, this edition also emphasized the importance of test fairness, bias and sensitivity (see Standard E12.1.2).With respect to item writers, the documentation of the qualifications of the item writers and editors were described as desirable for achievement tests. In addition, the concept of construct validity and the alignment of test content with theoretical constructs are addressed. Standards address the practice of using experts to judge the appropriateness of items as they relate to the “universe of tasks” represented by the test. The documentation of the qualifications of the experts and the agreement of the experts in selecting items can be viewed as a precursor to many item writing and alignment activities that are in place today.A new organizational structure was introduced in the 1985 document, grouping the technical standards together under “Technical Standards for Test Construction and Evaluation” and devoting a chapter to test development and revisions. The chapter included 25 standards that addressed test specifications, test development, fairness, item analysis and design associated with intended use and categorized each as primary, secondary, or conditional. The 1999 edition retained the 1985 structure by grouping technical measurement issues together under “Test Construction, Evaluation and Documentation” and dedicating a chapter to test development and revisions. The 27 standards included in 1999 paralleled the earlier version, but also included a more detailed introduction on test development. The introduction explored different item formats (selected-response, extended-response, portfolios, and performance items) and discussed implications for design. The importance of federal education law related to assessment and accountability was also discussed in greater detail in this edition.To reflect changes in education between 1999 and 2014 such as the passage of the No Child Left Behind Act in 2001, the most recent version of the Standards include a stronger focus on accountability issues associated with educational testing. This edition highlighted fairness as one of three “Foundations” chapters, elevating it to the same level as validity and reliability. The 2014 chapter on test design and development was moved to one of six chapters under “Operations.” The introduction to this chapter was expanded and specifically addressed item development and review through the articulation of content quality, clarity, construct-irrelevance, sensitivity, and appropriateness.The 2014 Standards continue to be used by testing organizations to identify and shape the processes and procedures followed to develop items. Current procedures used by testing organizations operationalize a process of item writing that begins with the design of the assessment and continues through operational testing. Standard 3.1 states that “those responsible for test development, revision, and administration should design all steps of the testing process to promote valid score interpretations for intended score uses for the widest possible range of individuals and relevant subgroups in the intended population.” In this way, the 2014 Standards incorporate the basic concepts of evidence-centered design (ECD) that have been the basis for sound development practices for decades.The 2014 Standards emphasize that item development is critical for providing quality and consistency for K–12 large-scale assessments. Most states and local education agencies that are responsible for large-scale assessment programs use the current Standards to structure the processes and provide the necessary documentation. Although there are variations on the process, most education agencies articulate procedures for item development, item writers, and review processes.With respect to core concepts in the Standards over the years of their evolution, item quality broadly defined can be thought of as the foundation of any argument for validity, reliability, and fairness. In K–12 assessment, the human element has loomed large in terms of its contribution to and evaluation of item quality. Teachers and other subject-matter experts (SMEs) write and edit items, and they serve on panels that review items drafted for large-scale assessments for appropriateness, accessibility, sensitivity, and bias, among other attributes. Alignment studies engage a wide range of stakeholders in judgmental reviews of items and their consistency with established content standards and other substantive descriptors. As we look ahead in item development, it is important to recognize the continued critical role of human judgment in applying the Standards as well as the qualitative evidence it provides to the profession and the public.There seems to be little doubt after the recent explosion of interest in artificial intelligence (AI) and large-language models (LLMs) in education that the future of item development will make every attempt to leverage AI to expand the supply of materials available for large-scale assessments in K–12 (Hao et al., 2024). Generations of SMEs have been producing independently countless test items psychometricians hope to be essentially interchangeable in terms of alignment to content specifications and construct-relevant contributions to score variance. Proponents of AI are rightfully optimistic about its potential to contribute to item development generally and in K–12 applications specifically. Although this statement is truer of some content areas than others, AI has the potential to provide important efficiencies of scale.The idea that computers would someday contribute to item development is not new (cf. Richards, 1967). Proponents of Automated Item Generation (AIG) for years have advanced methods to identify distinctive features of items to a degree of specificity that algorithms and/or item templates could be developed to write items on the fly, and successful small-scale applications have been developed in cognition as well as achievement (Embretson & Kingston, 2018; Gierl & Lai, 2015; but see Attali, 2018). As we write this in mid-2024, however, few large-scale assessments, particularly those used in K–12 accountability programs, have implemented AIG on an operational basis. What has occurred at a lightning pace in the last several years is the infusion of AI into current thinking about AIG and item development. Major providers of assessments in multiple areas of application are producing white papers and establishing advisory panels to help manage the role of AI in test and item development (Association of Test Publishers, 2021; Pearson VUE, 2024; Smarter Balanced Assessment Consortium, 2024). The AI infusion is likely to continue in the testing industry at a pace which the published literature in educational measurement will be challenged to keep up with.AI, technology, federal and state laws, and the emphasis on fairness and equity continue to have a major influence of the development of items for large scale K–12 assessments. The Standards have guided best practices in the development and interpretations of assessment items since the first edition in 1966 and will continue to do so through future editions. What we, as test developers, educational researchers, and practitioners need to keep front of mind for the appropriateness of our interpretations of assessment data, whether through simple raw scores or complex multilevel modeling analyses, is that the content and design of the items are the basis for all score interpretations and all decisions made from those scores. Graduate students in our field need to be, in many cases, better trained in the art and science of item development. Researchers conducting analyses on which high level decisions will be made need to remain cognizant that a test taker responding to a particular set of items is the basis on which everything they are doing rests. And as we leverage cloning, AIG, AI, and technology to boost our item pools, we need to remember that the educational achievement we are trying to assess all starts with a test taker responding to an item.Consistent with Russell et al. (2019), we believe that as the field continues to develop, it also risks splintering into different camps of test developers and psychometricians. To minimize this risk, we support the design and development of graduate programs that “embrace the full life cycle of instrument development, analyses, use, and refinement holds potential to develop rounded professionals who have a fuller appreciation of the challenges encountered during each stage of instrument development and use” (Russell et al., 2019, p. 86). This idea is consistent with the principles of ECD and is reflected in the most recent editions of the Standards, however it is not clear that graduate training programs in the field embrace it to the extent needed to ensure that future item development maintains the level of rigor called for by either ECD or the Standards. Perhaps it is time to elevate deep understanding of item development into the pantheon of critical skills necessary for defining what it means to be a psychometrician to ensure every interpretation of analyses and results takes into consideration the items on which the data were collected.","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 4","pages":"40-45"},"PeriodicalIF":1.9000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12647","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Educational Measurement-Issues and Practice","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/emip.12647","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

Abstract

As educational researchers, we take scored item responses, create data sets to analyze, draw inferences from those analyses, and make decisions, about students’ educational knowledge and future success, judge how successful educational programs are, determine what to teach tomorrow, and so on. It is good to remind ourselves that the basis for all our analyses, from simple means to complex multilevel, multidimensional modeling, interpretations of those analyses, and decisions we make based on the analyses are at the core based on a test taker responding to an item. With all the emphasis on modeling, analyses, big data, machine learning, etc., we need to remember it all starts with the items we collect information on. If we get those wrong, then the results of subsequent analyses are unlikely to provide the information we are seeking.

It is true that how students and educators interact with items has changed, and continues to change. More and more of the student-item interactions are happening online, and the days when an educator had relatively easy access to the actual test items, often after test administration, are in the past. This lack of access is also true for the researchers analyzing the response data: instead of a single test booklet aligned to a data file of test taker responses, there are large pools of items, and while the researcher may know a test taker was administered, say, item #SK-65243-0273A and what the response was, they do not know what the text of the item actually was, which can make it challenging to interpret analysis results at times.

From having a test author write the items for an assessment, to contracting with content specialists to draft items, to cloning items from a template, to having large language models/artificial intelligence produce items, item development has morphed over the past and present, and will continue to morph into the future. Item tryouts for pretesting the quality and functioning of an item, including gathering data for generating item statistics to aid in forms construction and in some instances scoring, now attempt to develop algorithms that can accurately predict item characteristics, including item statistics, without gathering item data in advance of operational use (or at all). We are developing more innovative item types, and collecting more data, such as latencies, click streams, and other process data on student responses to those items.

Sometimes we are so enamored of what we can do with the data, the analyses seem distant from the actual experience: a test taker responding to an item. And this makes it challenging at times to interpret analysis results in terms of actionable steps. Our aim here is to examine the evolution of how items are developed and considered, concentrating on large-scale, K–12 educational assessments.

The Standards for Educational and Psychological Testing (Standards; American Educational Research Association [AERA], the American Psychological Association [APA], and the National Council on Measurement in Education [NCME], 1966, 1974, 1985, 1999, 2014) have been the guiding principles for test developers and educational measurement specialists for decades. The Standards have evolved over time. They require consensus from three major organizations: APA, AERA, and NCME, which incorporate considerations of multiple viewpoints. The earliest editions seem to somewhat neglect treatment of individual items, concentrating instead on the collection of items or test forms. In keeping with The Past, Present, and Future of Educational Measurement theme, we use the five editions of the Standards to examine how the focus on items has morphed over the years, and to look ahead to the future edition of the Standards currently under development, and how they conceptualize issues related to ‘items’.

Our intent is to focus attention clearly on items, in all their formats, as the basis for our measurement decisions. The items test takers respond to are at the center of the data we collect, analyze, and interpret. And yet, at times the items seem very far from our focus. Graduate students in educational measurement typically have a general class on measurement early in their training, which covers basic foundational concepts such as reliability and validity, and typically, some treatment of item writing is often included, usually dealing with multiple choice items and good and bad item writing tips. Constructed response items and rubrics may also be covered. However, as students enter more advanced courses, it seems item statistics, p-values, point-biserials, IRT parameter estimates, are where the focus is. The actual text of items may not even be presented as “good” statistical properties, item bias and item fit statistics are discussed, and items in an assessment are retained or discarded based solely on statistical properties, ignoring the item text. There may be even more distance from the actual items as students generate data for simulation studies and get used to dealing with zeros and ones, removed from the actual items test takers respond to.

The Standards have evolved to reflect and address changes in the field of testing. With respect to the item development process, an expansion of the issues and content can be seen in the evolution along with a restructuring and repositioning of the areas of importance. This paper explores the evolution within the context of item development.

In the 1966 edition, topics pertaining to validity and reliability were emphasized and considered essential. Within the chapter devoted to validity, issues pertaining to content validity included standards that addressed the representativeness of items in constructing a test, the role of experts in the selection and review of items, and the match between items and test specifications were covered. For large-scale achievement tests, item writing was assumed to be completed by subject-matter specialists that “devise” and select items judged to cover the topics and processes related to the assessment being produced. Agreement among independent judgments when individual items are being selected by experts was also emphasized (Standard C3.11). However, details defining a process to follow in the item writing process were not emphasized. Similar issues pertaining to item writing were included in the 1974 edition. However, this edition also emphasized the importance of test fairness, bias and sensitivity (see Standard E12.1.2).

With respect to item writers, the documentation of the qualifications of the item writers and editors were described as desirable for achievement tests. In addition, the concept of construct validity and the alignment of test content with theoretical constructs are addressed. Standards address the practice of using experts to judge the appropriateness of items as they relate to the “universe of tasks” represented by the test. The documentation of the qualifications of the experts and the agreement of the experts in selecting items can be viewed as a precursor to many item writing and alignment activities that are in place today.

A new organizational structure was introduced in the 1985 document, grouping the technical standards together under “Technical Standards for Test Construction and Evaluation” and devoting a chapter to test development and revisions. The chapter included 25 standards that addressed test specifications, test development, fairness, item analysis and design associated with intended use and categorized each as primary, secondary, or conditional. The 1999 edition retained the 1985 structure by grouping technical measurement issues together under “Test Construction, Evaluation and Documentation” and dedicating a chapter to test development and revisions. The 27 standards included in 1999 paralleled the earlier version, but also included a more detailed introduction on test development. The introduction explored different item formats (selected-response, extended-response, portfolios, and performance items) and discussed implications for design. The importance of federal education law related to assessment and accountability was also discussed in greater detail in this edition.

To reflect changes in education between 1999 and 2014 such as the passage of the No Child Left Behind Act in 2001, the most recent version of the Standards include a stronger focus on accountability issues associated with educational testing. This edition highlighted fairness as one of three “Foundations” chapters, elevating it to the same level as validity and reliability. The 2014 chapter on test design and development was moved to one of six chapters under “Operations.” The introduction to this chapter was expanded and specifically addressed item development and review through the articulation of content quality, clarity, construct-irrelevance, sensitivity, and appropriateness.

The 2014 Standards continue to be used by testing organizations to identify and shape the processes and procedures followed to develop items. Current procedures used by testing organizations operationalize a process of item writing that begins with the design of the assessment and continues through operational testing. Standard 3.1 states that “those responsible for test development, revision, and administration should design all steps of the testing process to promote valid score interpretations for intended score uses for the widest possible range of individuals and relevant subgroups in the intended population.” In this way, the 2014 Standards incorporate the basic concepts of evidence-centered design (ECD) that have been the basis for sound development practices for decades.

The 2014 Standards emphasize that item development is critical for providing quality and consistency for K–12 large-scale assessments. Most states and local education agencies that are responsible for large-scale assessment programs use the current Standards to structure the processes and provide the necessary documentation. Although there are variations on the process, most education agencies articulate procedures for item development, item writers, and review processes.

With respect to core concepts in the Standards over the years of their evolution, item quality broadly defined can be thought of as the foundation of any argument for validity, reliability, and fairness. In K–12 assessment, the human element has loomed large in terms of its contribution to and evaluation of item quality. Teachers and other subject-matter experts (SMEs) write and edit items, and they serve on panels that review items drafted for large-scale assessments for appropriateness, accessibility, sensitivity, and bias, among other attributes. Alignment studies engage a wide range of stakeholders in judgmental reviews of items and their consistency with established content standards and other substantive descriptors. As we look ahead in item development, it is important to recognize the continued critical role of human judgment in applying the Standards as well as the qualitative evidence it provides to the profession and the public.

There seems to be little doubt after the recent explosion of interest in artificial intelligence (AI) and large-language models (LLMs) in education that the future of item development will make every attempt to leverage AI to expand the supply of materials available for large-scale assessments in K–12 (Hao et al., 2024). Generations of SMEs have been producing independently countless test items psychometricians hope to be essentially interchangeable in terms of alignment to content specifications and construct-relevant contributions to score variance. Proponents of AI are rightfully optimistic about its potential to contribute to item development generally and in K–12 applications specifically. Although this statement is truer of some content areas than others, AI has the potential to provide important efficiencies of scale.

The idea that computers would someday contribute to item development is not new (cf. Richards, 1967). Proponents of Automated Item Generation (AIG) for years have advanced methods to identify distinctive features of items to a degree of specificity that algorithms and/or item templates could be developed to write items on the fly, and successful small-scale applications have been developed in cognition as well as achievement (Embretson & Kingston, 2018; Gierl & Lai, 2015; but see Attali, 2018). As we write this in mid-2024, however, few large-scale assessments, particularly those used in K–12 accountability programs, have implemented AIG on an operational basis. What has occurred at a lightning pace in the last several years is the infusion of AI into current thinking about AIG and item development. Major providers of assessments in multiple areas of application are producing white papers and establishing advisory panels to help manage the role of AI in test and item development (Association of Test Publishers, 2021; Pearson VUE, 2024; Smarter Balanced Assessment Consortium, 2024). The AI infusion is likely to continue in the testing industry at a pace which the published literature in educational measurement will be challenged to keep up with.

AI, technology, federal and state laws, and the emphasis on fairness and equity continue to have a major influence of the development of items for large scale K–12 assessments. The Standards have guided best practices in the development and interpretations of assessment items since the first edition in 1966 and will continue to do so through future editions. What we, as test developers, educational researchers, and practitioners need to keep front of mind for the appropriateness of our interpretations of assessment data, whether through simple raw scores or complex multilevel modeling analyses, is that the content and design of the items are the basis for all score interpretations and all decisions made from those scores. Graduate students in our field need to be, in many cases, better trained in the art and science of item development. Researchers conducting analyses on which high level decisions will be made need to remain cognizant that a test taker responding to a particular set of items is the basis on which everything they are doing rests. And as we leverage cloning, AIG, AI, and technology to boost our item pools, we need to remember that the educational achievement we are trying to assess all starts with a test taker responding to an item.

Consistent with Russell et al. (2019), we believe that as the field continues to develop, it also risks splintering into different camps of test developers and psychometricians. To minimize this risk, we support the design and development of graduate programs that “embrace the full life cycle of instrument development, analyses, use, and refinement holds potential to develop rounded professionals who have a fuller appreciation of the challenges encountered during each stage of instrument development and use” (Russell et al., 2019, p. 86). This idea is consistent with the principles of ECD and is reflected in the most recent editions of the Standards, however it is not clear that graduate training programs in the field embrace it to the extent needed to ensure that future item development maintains the level of rigor called for by either ECD or the Standards. Perhaps it is time to elevate deep understanding of item development into the pantheon of critical skills necessary for defining what it means to be a psychometrician to ensure every interpretation of analyses and results takes into consideration the items on which the data were collected.

查看原文本刊更多论文

一开始，有一个项目……

作为教育研究人员，我们收集得分项目的回答，创建数据集来分析，从这些分析中得出推论，并做出决定，关于学生的教育知识和未来的成功，判断教育计划有多成功，决定明天教什么，等等。我们应该提醒自己，我们所有分析的基础，从简单的方法到复杂的多层次、多维的建模，对这些分析的解释，以及我们基于分析做出的决策，其核心都是基于考生对一个项目的反应。随着所有对建模、分析、大数据、机器学习等的强调，我们需要记住这一切都是从我们收集信息的项目开始的。如果我们弄错了，那么随后的分析结果就不太可能提供我们正在寻找的信息。的确，学生和教育者与物品互动的方式已经改变了，而且还在继续改变。越来越多的学生与试题的互动发生在网上，而教育工作者相对容易获得实际试题的时代，通常是在考试管理之后，已经成为过去。对于分析回答数据的研究人员来说，这种缺乏访问权限的情况也是如此：不是一个与考生回答数据文件一致的测试小册子，而是有大量的项目，虽然研究人员可能知道一个考生被管理了，比如，项目#SK-65243-0273A和回答是什么，但他们不知道项目的实际文本是什么，这有时会使解释分析结果变得具有挑战性。从让测试作者为评估编写项目，到与内容专家签订合同起草项目，到从模板克隆项目，再到使用大型语言模型/人工智能生成项目，项目开发已经在过去和现在发生了变化，并将继续在未来发生变化。为了预先测试项目的质量和功能而进行的项目测试，包括收集数据以生成项目统计数据以帮助表单构建和在某些情况下评分，现在试图开发能够准确预测项目特征（包括项目统计数据）的算法，而无需在操作使用之前收集项目数据（或根本不收集数据）。我们正在开发更多创新的项目类型，并收集更多的数据，例如延迟、点击流和学生对这些项目的响应的其他过程数据。有时候，我们对数据的处理能力太过痴迷，以至于这些分析似乎与实际体验相距甚远：一个考生对一个项目的反应。这使得有时用可操作的步骤来解释分析结果变得具有挑战性。我们在这里的目的是研究项目如何开发和考虑的演变，集中在大规模的K-12教育评估上。教育与心理测试标准(标准；美国教育研究协会（AERA）、美国心理协会（APA）和国家教育测量委员会（NCME）（1966, 1974, 1985, 1999, 2014）几十年来一直是测试开发者和教育测量专家的指导原则。这些标准随着时间的推移而发展。它们需要三个主要组织的一致意见：APA、AERA和NCME，它们结合了多种观点的考虑。最早的版本似乎在某种程度上忽视了对单个项目的处理，而把注意力集中在项目的集合或测试形式上。为了与“教育测量的过去、现在和未来”主题保持一致，我们使用标准的五个版本来研究这些年来对项目的关注是如何变化的，并展望目前正在开发的标准的未来版本，以及它们如何概念化与“项目”相关的问题。我们的目的是将注意力清楚地集中在所有形式的项目上，作为我们测量决策的基础。考生回答的问题是我们收集、分析和解释数据的中心。然而，有时这些项目似乎与我们的重点相去甚远。教育测量专业的研究生通常在他们的早期训练中有一个关于测量的一般课程，涵盖了基本的基础概念，如信度和效度，通常还包括一些项目写作的处理，通常是处理多项选择题和好的和坏的项目写作技巧。构建的响应项和规则也可能被涵盖。然而，随着学生进入更高级的课程，似乎项目统计，p值，点双序列，IRT参数估计是重点。项目的实际文本甚至可能不会被呈现为“良好”的统计属性，项目偏差和项目拟合统计被讨论，评估中的项目仅根据统计属性保留或丢弃，而忽略了项目文本。随着学生们为模拟研究生成数据，并习惯于处理0和1，与实际项目的距离可能会更远，而实际项目与考生的反应不同。这些标准的发展是为了反映和解决测试领域的变化。关于项目的发展过程，可以看到问题和内容的扩展，以及重要领域的重组和重新定位。本文探讨了项目发展背景下的演变。在1966年版中，强调了有关有效性和可靠性的主题，并认为这是必不可少的。在专门讨论效度的章节中，与内容效度有关的问题包括在构建测试时解决项目代表性的标准，专家在选择和审查项目中的作用，以及项目与测试规范之间的匹配。对于大规模成绩测试，假定题目编写由主题专家完成，他们“设计”和选择被判断为涵盖与正在编制的评估有关的主题和过程的题目。还强调了当专家选择个别项目时，独立判断之间的一致性（标准C3.11）。但是，没有强调确定项目编写过程应遵循的过程的细节。与项目编写有关的类似问题也列入1974年版。然而，本版本也强调了测试公平性、偏倚性和敏感性的重要性（见标准E12.1.2）。关于项目编写者，有人说，项目编写者和编辑的资格文件是成绩测试所需要的。此外，还讨论了构念效度的概念以及测试内容与理论构念的一致性。标准解决了使用专家来判断项目的适当性的实践，因为它们与测试所代表的“任务范围”有关。专家资格的文件和专家在选择项目方面的协议可以被视为今天许多项目编写和协调活动的先驱。在1985年的文件中引入了一个新的组织结构，将技术标准分组在“测试建设和评估技术标准”下，并专门用一章来讨论测试的开发和修订。本章包括25个标准，涉及测试规范、测试开发、公平性、项目分析和与预期用途相关的设计，并将每个标准分为主要的、次要的或有条件的。1999年版保留了1985年的结构，将技术度量问题分组在“测试构建、评估和文档”下，并专门为测试开发和修订写了一章。1999年包含的27个标准与早期版本相似，但也包含了对测试开发的更详细的介绍。引言探讨了不同的项目格式（选择响应、扩展响应、组合和性能项目），并讨论了设计的含义。与评估和问责制有关的联邦教育法的重要性也在本版本中进行了更详细的讨论。为了反映1999年至2014年间教育的变化，例如2001年通过的《不让一个孩子掉队法》，最新版本的标准更加关注与教育考试相关的问责问题。这一版本突出了公平性作为三个“基础”章节之一，将其提升到与有效性和可靠性相同的水平。2014年关于测试设计和开发的章节被移到了“操作”下的六个章节之一。本章的引言进行了扩展，并通过阐述内容质量、清晰度、结构无关性、敏感性和适当性，专门讨论了项目开发和审查。2014年标准继续被测试组织用于识别和塑造开发项目所遵循的过程和程序。测试组织使用的当前程序使项目写作的过程可操作，该过程从评估的设计开始，并通过可操作的测试继续进行。标准3.1规定：“负责考试开发、修订和管理的人员应设计考试过程的所有步骤，以促进有效的分数解释，以尽可能广泛地为目标人群中的个人和相关子群体提供预期分数。”通过这种方式，2014年标准纳入了以证据为中心的设计（ECD）的基本概念，这些概念几十年来一直是良好发展实践的基础。2014年标准强调，项目开发对于提供K-12大规模评估的质量和一致性至关重要。大多数负责大规模评估项目的州和地方教育机构使用现行标准来构建流程并提供必要的文件。尽管在过程中有变化，大多数教育机构阐明了项目开发、项目编写和审查过程的过程。对于经过多年发展的标准核心概念，广义的项目质量可以被认为是有效性、可靠性和公平性的基础。在K-12的评估中，人的因素在其对项目质量的贡献和评估方面显得非常重要。教师和其他主题专家（sme）编写和编辑项目，他们作为小组成员，审查为大规模评估起草的项目，以评估适当性、可及性、敏感性和偏见等属性。一致性研究涉及范围广泛的利益相关者，对项目及其与已建立的内容标准和其他实质性描述符的一致性进行判断性审查。当我们展望项目的发展时，重要的是要认识到人的判断在应用准则以及它向专业人士和公众提供的定性证据方面继续发挥关键作用。在最近对人工智能（AI）和大语言模型（llm）在教育领域的兴趣激增之后，似乎毫无疑问，项目开发的未来将尽一切努力利用AI来扩大K-12中大规模评估可用材料的供应（Hao et al., 2024）。几代中小企业一直在独立生产无数的测试项目，心理测量学家希望在内容规范的一致性和对得分方差的构建相关贡献方面基本上是可互换的。人工智能的支持者对其在物品开发方面的潜力持乐观态度，特别是在K-12应用程序方面。虽然这句话在某些内容领域比其他领域更真实，但人工智能具有提供重要规模效率的潜力。计算机有朝一日将有助于项目开发的想法并不新鲜（参见Richards， 1967）。多年来，自动项目生成（AIG）的支持者已经采用了先进的方法来识别项目的独特特征，达到一定程度的特异性，从而可以开发算法和/或项目模板来动态地编写项目，并且已经在认知和成就方面开发了成功的小规模应用程序(Embretson &；金斯顿,2018;Gierl,赖,2015;但参见Attali， 2018)。然而，当我们在2024年中期写这篇文章时，很少有大规模的评估，特别是那些在K-12问责计划中使用的评估，已经在运营基础上实施了AIG。在过去的几年里，人工智能以闪电般的速度被注入到人们对AIG和项目开发的思考中。多个应用领域的主要评估提供商正在编写白皮书并建立咨询小组，以帮助管理人工智能在测试和项目开发中的作用(Association of test Publishers, 2021；Pearson VUE, 2024；智能平衡评估联盟，2024)。在测试行业，人工智能的注入很可能会继续下去，其速度之快将挑战教育测量领域的已发表文献。人工智能、技术、联邦和州法律以及对公平和公平的强调继续对大规模K-12评估项目的开发产生重大影响。自1966年第一版以来，该标准指导了评估项目的制定和解释的最佳实践，并将在未来的版本中继续这样做。作为测试开发者、教育研究者和实践者，我们需要牢记的是，无论是通过简单的原始分数还是复杂的多层次模型分析，我们对评估数据解释的适当性是，项目的内容和设计是所有分数解释和所有决策的基础。在许多情况下，我们领域的研究生需要在项目开发的艺术和科学方面得到更好的训练。研究人员在进行高水平决策分析时，需要始终认识到，考生对一组特定问题的反应是他们所做的一切事情的基础。当我们利用克隆、AIG、人工智能和技术来增加我们的项目库时，我们需要记住，我们试图评估的教育成就都是从测试者对一个项目的反应开始的。与Russell等人（2019）的观点一致，我们认为随着该领域的不断发展，它也有分裂成测试开发人员和心理测量学家不同阵营的风险。为了最大限度地降低这种风险，我们支持研究生课程的设计和开发，这些课程“包含仪器开发、分析、使用和改进的整个生命周期，有潜力培养全面发展的专业人士，他们对仪器开发和使用的每个阶段遇到的挑战有更全面的认识”（Russell等人，2019年，第86页）。这个想法与ECD的原则是一致的，并反映在最新版本的标准中，然而，尚不清楚该领域的研究生培训计划是否在一定程度上接受了它，以确保未来的项目开发保持ECD或标准所要求的严格程度。也许是时候提升对项目开发的深刻理解，使其成为定义心理测量学家的必要关键技能，以确保对分析和结果的每一次解释都考虑到收集数据的项目。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Educational Measurement-Issues and Practice Multiple-

CiteScore

3.90

自引率

15.00%

发文量