Validity: Conceptualizations for anatomy and health professions educators

IF 4.7 2区教育学 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

Anatomical Sciences Education Pub Date : 2025-03-25 DOI:10.1002/ase.70016

Victoria A. Roach PhD

{"title":"Validity: Conceptualizations for anatomy and health professions educators","authors":"Victoria A. Roach PhD","doi":"10.1002/ase.70016","DOIUrl":null,"url":null,"abstract":"Validity, a cornerstone of assessment theory and practice, holds particular importance in the context of health professions education. In disciplines such as anatomy education, where assessments often bear significant consequences for learners and their future roles in healthcare, the integrity and applicability of testing instruments are paramount. Validity evidence is essential for three primary reasons: (1) it provides support for the intended purpose of the assessment, (2) it conveys to stakeholders that the results of the assessment are both credible and meaningful, and (3) it guides test development and refinement towards the goal of ensuring fairness, improved decision-making, and protection against the misuse of instruments. Without validity evidence, the results of an assessment may be misleading, misinterpreted, or improperly applied, potentially leading to incorrect conclusions or decisions about a student's progress.This editorial explores the evolution of the concept of validity, discusses contemporary perspectives of validity, and discusses validity's critical role in anatomical and health professions education (HPE) assessment. By examining validity's historical development, contemporary perspectives, and practical implications, educators can better understand how to design, analyze, and refine assessment and evaluation tools that meet the rigorous demands of HPE.The concept of validity has evolved significantly in the nearly eight decades since its emergence. The earliest recorded conceptualizations of validity surfaced in the late 1940s1 and early 1950s,2 in a series of technical reports authored by the American Psychological Association (APA).2, 3 In the first technical report, four types of validity were identified: predictive, content, congruent, and status.2 This report was quickly followed by a second technical report, in which four different types of validity were described: construct, content, predictive, and concurrent3 (For definitions of these, and select other early types or aspects of validity that have been proposed, see Table 1).This second technical report would lay the groundwork for what would become known as “The Standards”; a set of recommendations prepared and revised by the Joint Committee on Standards for Educational and Psychological Testing of the American Educational Research Association, the APA, and the National Council on Measurement in Education (NCME), as guidelines for the evaluation, development, and use of testing instruments.4The first edition of the Standards, published as “The Standards for Educational and Psychological Tests and Manuals”,4 introduced “criterion-related” validity, encompassing predictive and concurrent validity as subcomponents, leading to a three-part validity structure.4 This three-part validity structure, consisting of construct, criterion, and content validity, would commonly become known as the Validty Trinity. The Validity Trinity, or “Three C's” of validity, represents the “classical” view of validity,5 which was held widely from 1966 through the late 1980s6 (For a timeline of the evolution of our societal understanding of validity, see Figure 1).The Validity Trinity was challenged in 1989 by Samuel Messick, who proposed that the Validity Trinity was fragmented and incomplete, as it failed to take into account both the evidence of the value implications of test scores as a basis for action and the social consequences of score use.7, 8 Consequently, Messick presented the unified validity framework, which redefined the concept of validity in testing.9 This framework emphasized that validity is not merely a collection of separate and substitutable types but a unitary construct centered around the concept of construct validity.7 Messick emphasized that construct validity should encompass all forms of validity evidence, making it the central focus of measurement, which may occur quantitatively and/or qualitatively in a variety of forms (e.g., test scores, performance/skill assessments, surveys, observations, interviews, and content analyses).10 To address the complexity of validity as a unified concept, Messick proposed six interdependent and complementary sources of validity evidence.11 The six sources were later condensed to five when the unified theory of validity replaced the Validity Trinity in the 1999 revision to The Standards.12Specifically, The Standards outline the following sources of evidence to support an instrument's validity: evidence that is based on test content, response processes, internal structure, relations to other variables, and finally, evidence based on the consequences of testing13 (See Table 2 for descriptions of each source of validity evidence outlined in The Standards; for examples and elaboration, please also see Beckman et al.14).Taken together, these sources of evidence function as general validity criteria for all educational and psychological measurements, including performance assessments.7-9, 13, 15-17 Thus, in this conceptualization, validity may be operationally defined as the degree to which an instrument accurately measures what it is intended to measure, meaning the results can be confidently interpreted as reflecting the intended skills, knowledge, or attitudes being assessed.13In HPE, validity is often invoked to assert the quality of an assessment instrument or tool and justify its use in high-stakes decisions, such as admissions and fitness for practice. This reliance on validity, sometimes referred to as a “god term”,18, 19 is used rhetorically to convey adherence to high standards.19 However, there is notable inconsistency in how validity is conceptualized and applied, as highlighted by recent reviews.20, 21 While this variability is often attributed to a lack of understanding of modern validity theories, it may also reflect the diverse conceptualizations of validity that serve different purposes in HPE, a field influenced by a range of disciplines, including educational psychology, sociology, and measurement science.19In 2017, St-Onge et al. identified and articulated three distinct yet coexisting conceptualizations of validity within the HPE literature: validity as a test characteristic, validity as an argument-based evidentiary chain, and validity as a social imperative.19Validity as a Test Characteristic considers validity an intrinsic quality of an assessment instrument or tool. From this perspective, an instrument that has been validated in one context is assumed to maintain that validity universally across similar contexts. This approach offers practical utility by suggesting that ‘validated’ tools provide a “gold seal” of instrument quality, which can be particularly appealing for those seeking quick, reliable assessment instruments and who have no desire to frequently collect new data to support an instrument's use in different contexts.19Those adopting this conceptualization view validity as an inherent property of the tool itself, independent of the specific content or context in which it is used. One adopting this conceptualization of validity may seek to answer the question, “Does this instrument or tool accurately measure what it claims to measure?” However, individuals adopting this conceptualization should be cautious, as the treatment of validity as an unchanging characteristic of the instrument can create a “false sense of security,” when applied in different contexts or populations.19 This is particularly concerning when instruments are utilized in different contexts and with different intentions beyond the original area of study. For instance, Gould (1996) highlights the risks of misusing “validated” tests in The Mismeasure of Man, showing how IQ tests, intended to measure intelligence with a single score, were applied in inappropriate contexts—such as immigration—to wrongly categorize entire ethnic groups as less intelligent.22Validity as an Argument-Based Evidentiary Chain emphasizes a more dynamic process, where evidence must continually support the interpretations and uses of assessment instrument results. In this view, validation is ongoing and context-dependent, focusing on how an assessment instrument is used rather than its inherent qualities. This discourse aligns well with the scientific method, requiring an accumulation of evidence for each new assessment scenario, thus fostering rigorous examination and flexibility in test interpretation.19 Kane's framework supports this notion of validity as an Argument-Based Evidentiary Chain, which emphasizes that validity is established through a structured argument, where evidence is gathered to support specific claims about test score interpretations and uses.23 His framework outlines four stages—scoring, generalization, extrapolation, and interpretation/use—each requiring distinct evidence to ensure the assessment instrument's results are reliable, representative, and appropriately applied. This model also aligns with Messick's unified theory by emphasizing that validity is not inherent in the test itself but is demonstrated through the evidence supporting the interpretation and use of scores.\n In this conceptualization, validity is not tied to the tool but rather to interpreting its results, relying on continuous evidence collection and analysis. An individual adopting this conceptualization of validity may seek to answer the question: “Am I drawing valid inferences from these test scores?” or “Are my interpretations of these test scores valid?”\n Validity as a Social Imperative broadens the concept of validity beyond the test and its interpretations to include the societal and ethical impacts of assessment. This view stresses that assessments should serve to measure and support societal needs and ethical considerations, such as equity and the holistic preparation of healthcare providers. Validity, from this perspective, thus emphasizes the broader consequences of testing on learners and society, highlighting the importance of fair and responsible assessment practices.19 It expands the traditional technical perspective to include societal impact, recognizing that validity is about accurate measurement and ensuring that assessments contribute to a just and inclusive society. The notion of validity as a social imperative is supported by the Explanation-Focused View of Validity,24 which aims to embed fairness, equity, and ethical considerations into its framework. The Explanation-focused View of Validity, like the Argument-Based Evidentiary Chain conceptualization, defines test validity as the process of understanding and explaining variations in test scores, but emphasizes the contextual, ecological, and social factors that influence performance. This approach extends beyond the psychometric properties of the instrument or tool to encompass broader concerns about fairness, equity, and the real-world implications of assessments.24The ‘Validity as Social Imperative’ conceptualization emphasizes assessments' broader societal and individual impacts, highlighting the ethical responsibility of ensuring positive and equitable outcomes from assessment practices. An individual adopting this conceptualization of validity may seek to answer the question: “What are the consequences or impacts of this assessment for the individual learner and for society?” While the “Validity as Argument Based Evidentiary Chain” framework stands in direct opposition to the “Validity as Test Characteristic” framework, it stands as a strong complement to the “Validity as Social Imperative” framework. Both frameworks may be held together, and evidence of both conceptualizations may be collected in parallel. This tandem approach may be of the most value when addressing concerns of fairness, equity, and consequences resulting from test scores. This is especially true in high-stakes situations, where the consequences of test outcomes are severe, and patient safety may be impacted.Victoria A. Roach: Conceptualization; writing – review and editing; writing – original draft; resources.No grants or funding supported the production of this manuscript.","PeriodicalId":124,"journal":{"name":"Anatomical Sciences Education","volume":"18 8","pages":"751-756"},"PeriodicalIF":4.7000,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ase.70016","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anatomical Sciences Education","FirstCategoryId":"95","ListUrlMain":"https://anatomypubs.onlinelibrary.wiley.com/doi/10.1002/ase.70016","RegionNum":2,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Validity, a cornerstone of assessment theory and practice, holds particular importance in the context of health professions education. In disciplines such as anatomy education, where assessments often bear significant consequences for learners and their future roles in healthcare, the integrity and applicability of testing instruments are paramount. Validity evidence is essential for three primary reasons: (1) it provides support for the intended purpose of the assessment, (2) it conveys to stakeholders that the results of the assessment are both credible and meaningful, and (3) it guides test development and refinement towards the goal of ensuring fairness, improved decision-making, and protection against the misuse of instruments. Without validity evidence, the results of an assessment may be misleading, misinterpreted, or improperly applied, potentially leading to incorrect conclusions or decisions about a student's progress.

This editorial explores the evolution of the concept of validity, discusses contemporary perspectives of validity, and discusses validity's critical role in anatomical and health professions education (HPE) assessment. By examining validity's historical development, contemporary perspectives, and practical implications, educators can better understand how to design, analyze, and refine assessment and evaluation tools that meet the rigorous demands of HPE.

The concept of validity has evolved significantly in the nearly eight decades since its emergence. The earliest recorded conceptualizations of validity surfaced in the late 1940s¹ and early 1950s,² in a series of technical reports authored by the American Psychological Association (APA).^{2, 3} In the first technical report, four types of validity were identified: predictive, content, congruent, and status.² This report was quickly followed by a second technical report, in which four different types of validity were described: construct, content, predictive, and concurrent³ (For definitions of these, and select other early types or aspects of validity that have been proposed, see Table 1).

This second technical report would lay the groundwork for what would become known as “The Standards”; a set of recommendations prepared and revised by the Joint Committee on Standards for Educational and Psychological Testing of the American Educational Research Association, the APA, and the National Council on Measurement in Education (NCME), as guidelines for the evaluation, development, and use of testing instruments.⁴

The first edition of the Standards, published as “The Standards for Educational and Psychological Tests and Manuals”,⁴ introduced “criterion-related” validity, encompassing predictive and concurrent validity as subcomponents, leading to a three-part validity structure.⁴ This three-part validity structure, consisting of construct, criterion, and content validity, would commonly become known as the Validty Trinity. The Validity Trinity, or “Three C's” of validity, represents the “classical” view of validity,⁵ which was held widely from 1966 through the late 1980s⁶ (For a timeline of the evolution of our societal understanding of validity, see Figure 1).

The Validity Trinity was challenged in 1989 by Samuel Messick, who proposed that the Validity Trinity was fragmented and incomplete, as it failed to take into account both the evidence of the value implications of test scores as a basis for action and the social consequences of score use.^{7, 8} Consequently, Messick presented the unified validity framework, which redefined the concept of validity in testing.⁹ This framework emphasized that validity is not merely a collection of separate and substitutable types but a unitary construct centered around the concept of construct validity.⁷ Messick emphasized that construct validity should encompass all forms of validity evidence, making it the central focus of measurement, which may occur quantitatively and/or qualitatively in a variety of forms (e.g., test scores, performance/skill assessments, surveys, observations, interviews, and content analyses).¹⁰ To address the complexity of validity as a unified concept, Messick proposed six interdependent and complementary sources of validity evidence.¹¹ The six sources were later condensed to five when the unified theory of validity replaced the Validity Trinity in the 1999 revision to The Standards.¹²

Specifically, The Standards outline the following sources of evidence to support an instrument's validity: evidence that is based on test content, response processes, internal structure, relations to other variables, and finally, evidence based on the consequences of testing¹³ (See Table 2 for descriptions of each source of validity evidence outlined in The Standards; for examples and elaboration, please also see Beckman et al.¹⁴).

Taken together, these sources of evidence function as general validity criteria for all educational and psychological measurements, including performance assessments.^{7-9, 13, 15-17} Thus, in this conceptualization, validity may be operationally defined as the degree to which an instrument accurately measures what it is intended to measure, meaning the results can be confidently interpreted as reflecting the intended skills, knowledge, or attitudes being assessed.¹³

In HPE, validity is often invoked to assert the quality of an assessment instrument or tool and justify its use in high-stakes decisions, such as admissions and fitness for practice. This reliance on validity, sometimes referred to as a “god term”,^{18, 19} is used rhetorically to convey adherence to high standards.¹⁹ However, there is notable inconsistency in how validity is conceptualized and applied, as highlighted by recent reviews.^{20, 21} While this variability is often attributed to a lack of understanding of modern validity theories, it may also reflect the diverse conceptualizations of validity that serve different purposes in HPE, a field influenced by a range of disciplines, including educational psychology, sociology, and measurement science.¹⁹

In 2017, St-Onge et al. identified and articulated three distinct yet coexisting conceptualizations of validity within the HPE literature: validity as a test characteristic, validity as an argument-based evidentiary chain, and validity as a social imperative.¹⁹

Validity as a Test Characteristic considers validity an intrinsic quality of an assessment instrument or tool. From this perspective, an instrument that has been validated in one context is assumed to maintain that validity universally across similar contexts. This approach offers practical utility by suggesting that ‘validated’ tools provide a “gold seal” of instrument quality, which can be particularly appealing for those seeking quick, reliable assessment instruments and who have no desire to frequently collect new data to support an instrument's use in different contexts.¹⁹

Those adopting this conceptualization view validity as an inherent property of the tool itself, independent of the specific content or context in which it is used. One adopting this conceptualization of validity may seek to answer the question, “Does this instrument or tool accurately measure what it claims to measure?” However, individuals adopting this conceptualization should be cautious, as the treatment of validity as an unchanging characteristic of the instrument can create a “false sense of security,” when applied in different contexts or populations.¹⁹ This is particularly concerning when instruments are utilized in different contexts and with different intentions beyond the original area of study. For instance, Gould (1996) highlights the risks of misusing “validated” tests in The Mismeasure of Man, showing how IQ tests, intended to measure intelligence with a single score, were applied in inappropriate contexts—such as immigration—to wrongly categorize entire ethnic groups as less intelligent.²²

Validity as an Argument-Based Evidentiary Chain emphasizes a more dynamic process, where evidence must continually support the interpretations and uses of assessment instrument results. In this view, validation is ongoing and context-dependent, focusing on how an assessment instrument is used rather than its inherent qualities. This discourse aligns well with the scientific method, requiring an accumulation of evidence for each new assessment scenario, thus fostering rigorous examination and flexibility in test interpretation.¹⁹ Kane's framework supports this notion of validity as an Argument-Based Evidentiary Chain, which emphasizes that validity is established through a structured argument, where evidence is gathered to support specific claims about test score interpretations and uses.²³ His framework outlines four stages—scoring, generalization, extrapolation, and interpretation/use—each requiring distinct evidence to ensure the assessment instrument's results are reliable, representative, and appropriately applied. This model also aligns with Messick's unified theory by emphasizing that validity is not inherent in the test itself but is demonstrated through the evidence supporting the interpretation and use of scores.

In this conceptualization, validity is not tied to the tool but rather to interpreting its results, relying on continuous evidence collection and analysis. An individual adopting this conceptualization of validity may seek to answer the question: “Am I drawing valid inferences from these test scores?” or “Are my interpretations of these test scores valid?”

Validity as a Social Imperative broadens the concept of validity beyond the test and its interpretations to include the societal and ethical impacts of assessment. This view stresses that assessments should serve to measure and support societal needs and ethical considerations, such as equity and the holistic preparation of healthcare providers. Validity, from this perspective, thus emphasizes the broader consequences of testing on learners and society, highlighting the importance of fair and responsible assessment practices.¹⁹ It expands the traditional technical perspective to include societal impact, recognizing that validity is about accurate measurement and ensuring that assessments contribute to a just and inclusive society. The notion of validity as a social imperative is supported by the Explanation-Focused View of Validity,²⁴ which aims to embed fairness, equity, and ethical considerations into its framework. The Explanation-focused View of Validity, like the Argument-Based Evidentiary Chain conceptualization, defines test validity as the process of understanding and explaining variations in test scores, but emphasizes the contextual, ecological, and social factors that influence performance. This approach extends beyond the psychometric properties of the instrument or tool to encompass broader concerns about fairness, equity, and the real-world implications of assessments.²⁴

The ‘Validity as Social Imperative’ conceptualization emphasizes assessments' broader societal and individual impacts, highlighting the ethical responsibility of ensuring positive and equitable outcomes from assessment practices. An individual adopting this conceptualization of validity may seek to answer the question: “What are the consequences or impacts of this assessment for the individual learner and for society?” While the “Validity as Argument Based Evidentiary Chain” framework stands in direct opposition to the “Validity as Test Characteristic” framework, it stands as a strong complement to the “Validity as Social Imperative” framework. Both frameworks may be held together, and evidence of both conceptualizations may be collected in parallel. This tandem approach may be of the most value when addressing concerns of fairness, equity, and consequences resulting from test scores. This is especially true in high-stakes situations, where the consequences of test outcomes are severe, and patient safety may be impacted.

Victoria A. Roach: Conceptualization; writing – review and editing; writing – original draft; resources.

No grants or funding supported the production of this manuscript.

Abstract Image

查看原文本刊更多论文

效度：解剖学和卫生专业教育者的概念。

有效性是评估理论和实践的基石，在卫生专业教育中具有特别重要的意义。在解剖学教育等学科中，评估往往对学习者及其未来在医疗保健中的角色产生重大影响，测试工具的完整性和适用性至关重要。有效性证据至关重要，主要有三个原因：(1)它为评估的预期目的提供支持，(2)它向利益相关者传达评估结果既可信又有意义，(3)它指导测试开发和改进，以确保公平、改进决策和防止工具滥用的目标。如果没有有效证据，评估结果可能会误导、误解或不恰当地应用，从而可能导致对学生进步的错误结论或决定。这篇社论探讨了效度概念的演变，讨论了当代效度的观点，并讨论了效度在解剖学和卫生专业教育（HPE）评估中的关键作用。通过研究有效性的历史发展、当代观点和实际意义，教育工作者可以更好地理解如何设计、分析和完善评估和评估工具，以满足HPE的严格要求。效度的概念自出现以来，在近八十年的时间里发生了重大的变化。最早的有效性概念出现在20世纪40年代末和50年代初，由美国心理协会（APA）撰写的一系列技术报告中。在第一份技术报告中，确定了四种效度类型：预测性、内容性、一致性和状态性这个报告之后很快就有了第二个技术报告，其中描述了四种不同类型的效度：构造、内容、预测和并发3（对于这些的定义，并选择其他早期的效度类型或方面，请参见表1）。第二份技术报告将为后来被称为“标准”的内容奠定基础；由美国教育研究协会（APA）和国家教育测量委员会（NCME）的教育和心理测试标准联合委员会编写和修订的一套建议，作为评估、开发和使用测试工具的指导方针。4标准的第一版以“教育和心理测试及手册标准”出版，4引入了“与标准相关的”效度，将预测效度和同时效度作为子成分包括在内，形成了三部分效度结构这种三部分效度结构，包括构念、标准和内容效度，通常被称为效度三位一体。效度三位一体，或效度的“三个C”，代表了效度的“经典”观点，5从1966年到20世纪80年代末被广泛持有（关于我们社会对效度理解的演变时间轴，见图1）。1989年，塞缪尔·梅西克（Samuel Messick）对有效性三位一体提出了质疑，他提出有效性三位一体是支离破碎和不完整的，因为它没有考虑到作为行动基础的测试分数的价值含义的证据和分数使用的社会后果。因此，Messick提出了统一的效度框架，重新定义了测试效度的概念这一框架强调，效度不仅仅是一系列独立的、可替代的类型的集合，而是一个以构念效度概念为中心的统一构念Messick强调，结构效度应该包含所有形式的效度证据，使其成为测量的中心焦点，测量可能以各种形式（例如，测试分数，表现/技能评估，调查，观察，访谈和内容分析）定量和/或定性地发生为了解决效度作为一个统一概念的复杂性，梅西克提出了六种相互依存和互补的效度证据来源在1999年修订的《标准》中，当统一的有效性理论取代了有效性三位一体时，这六种来源被浓缩为五种。具体来说，《标准》概述了支持工具有效性的证据来源：基于测试内容、反应过程、内部结构、与其他变量的关系的证据，最后是基于测试结果的证据13(见表2，标准中概述的每个效度证据来源的描述；有关示例和详细说明，请参见Beckman et al.14)。综上所述，这些证据来源作为所有教育和心理测量的一般有效性标准，包括绩效评估。因此，从这个角度来看，有效性强调了考试对学习者和社会的更广泛的影响，强调了公平和负责任的评估实践的重要性它扩展了传统的技术视角，将社会影响纳入其中，认识到有效性是关于准确的测量，并确保评估有助于建立一个公正和包容的社会。效度作为一种社会必要性的概念得到了以解释为中心的效度观的支持，其目的是将公平、公平和伦理考虑嵌入其框架。以解释为中心的效度观，如以论证为基础的证据链概念化，将考试效度定义为理解和解释考试成绩变化的过程，但强调影响成绩的环境、生态和社会因素。这种方法超越了仪器或工具的心理测量属性，涵盖了更广泛的关于公平、公平和评估的现实世界含义的关注。24“有效性作为社会必要性”的概念强调评估的更广泛的社会和个人影响，强调确保评估实践产生积极和公平结果的道德责任。采用这种效度概念的个人可能会试图回答这个问题：“这种评估对学习者个人和社会有什么后果或影响？”虽然“有效性作为基于论证的证据链”框架与“有效性作为测试特征”框架是直接对立的，但它是对“有效性作为社会必要性”框架的有力补充。两个框架可以放在一起，并且可以并行地收集两个概念化的证据。这种串联的方法可能是最有价值的，当解决问题的公平，公平，和后果的考试成绩。在高风险的情况下尤其如此，在这种情况下，检测结果的后果是严重的，患者的安全可能受到影响。维多利亚·a·罗奇：概念化；写作——审阅和编辑；写作——原稿；资源。没有赠款或资金支持该手稿的制作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Anatomical Sciences Education Anatomy/education-

CiteScore

10.30

自引率

39.70%

发文量

期刊介绍： Anatomical Sciences Education, affiliated with the American Association for Anatomy, serves as an international platform for sharing ideas, innovations, and research related to education in anatomical sciences. Covering gross anatomy, embryology, histology, and neurosciences, the journal addresses education at various levels, including undergraduate, graduate, post-graduate, allied health, medical (both allopathic and osteopathic), and dental. It fosters collaboration and discussion in the field of anatomical sciences education.