{"title":"学生评价的效度与信度:作为有效信度统一概念的认知案例","authors":"Kapil Gupta","doi":"10.4103/ijabmr.ijabmr_382_23","DOIUrl":null,"url":null,"abstract":"Students’ assessment, the major component of learning cycle, and curriculum serve many functions.[1-4] Assessment suggests areas of improvement during the training; the selection of students based on performance helps in the evaluation of the program and also has predictive utility. The assessment can be of learning – summative assessment, for learning - formative assessment, and can be without any external supervision – internal assessment. The importance and necessity of assessment in the learning cycle of the students can be gauged from the wordings of various proverbs in use, such as – ”assessment derives learning,” “assessment leads to learning,” and “assessment are the tail that wags the curriculum dog.” The students do modify their learning as per assessment. To make a precise decision about student’s learning and competency, the assessment must have both measurable and nonmeasurable components.[5] Van der Vleuten and Schuwirth defined assessment “as any formal or purported action to obtain information about the competence and performance of a student.”[6] Further, the assessment can be either criterion-referenced-comparing the competence of students against some fixed criteria or norm-referenced-comparing the performance of students with each other. Besides an aid to learning by virtue of having a provision of feedback and thus improve learning, assessment has reverse side too-improperly designed assessment can disfigure the learning. Therefore, any assessment should possess certain qualities or attributes. Traditional Concept Two important attributes defining students’ assessment are – reliability and validity. Conventionally, reliability of an assessment tool has commonly been referred to as “reproducibility” or “getting the same scores/marks under same conditions” or “precision of the measurement” or “consistency with which a test measures what it is supposed to assess.”[7] Reliability is measurable. As per classical test theory, the alpha coefficient (AC) is a range from 0 (no reliability) to 1 (perfect reliability); so if the test has an AC of 0.8%, it means it has a reliability of 80%, while measurement error is 20%.[8] The major factor affecting reliability is content or domain specificity. How an assessment can be reliable if it is based on a limited sampling of content or large content has been included in a single sample or if it is based on a single test? Moreover, a score that is derived from solving one problem cannot be interpolated for the second one. For example, assessment scores that are based on a single long case or viva for a single patient sample cannot produce reliable scores for another problem. If at the end of any professional year, subject knowledge is assessed by single multiple-choice questions (MCQs) based test of 10 items, can it measure students’ knowledge for the whole subject? Such assessments can be held valid but not reliable. Therefore, for any assessment test to be reliable, it is important to have a representation of the entire content as well as adequate sampling. Further, the reliability can also be increased by increasing the testing time, separating the whole content into multiple tests rather than a single test, and selecting battery of tests to access the same competency. Many studies have observed that almost the same reliability scores can be achieved with many assessment tools/methods if we increase the testing time and appropriate sampling is done.[9-15] Validity, another important characteristic of good assessment, is usually defined as measuring what it intends to measure. Validity is a unitary concept, and evidence can be drawn from many aspects such as content and construct-related and empirical evidence; therefore, the validity of assessment cannot be represented by a single coefficient like that of reliability. For example, if the performance of urinary blood proteins is part of 1st-year undergraduate medical training and these are not assessed in skill examination, the content-related validity is at stake. The construct validity consists of many attributes and mainly focuses on the problem-solving abilities of students based on subject knowledge, skills, data gathering and analysis, interpretation, communication, ethics, and many more. Thus, traditional concept was restricted to validity and reliability of the assessment tools and considered both as separate and unrelated entities [Figure 1].Figure 1: Traditional concept of validity and reliabilityContemporary View Of late, we have moved on from the historical concepts of validity and reliability of the assessment tools. Now, for educational purposes, we are not interested in the reliability of an assessment tool. Now, the more important aspect is how we use the tool to make the results reliable. Hence, as per the contemporary view, what we are really interested in is – ”Rely-ability” on our assessment results – How much we can trust our assessment results to make final judgments about students. Similarly, the contemporary concept of validity focuses on the interpretation that we make out of assessment data and not on the validity of assessment tools. So it is often said – No assessment tool/method is inherently invalid, more important is - what inference we draw from the assessment made using that tool. For example, the MCQs used will measure factual knowledge if it is designed to check factual knowledge; however, if some case-based scenario or some management plan for any disease is in-built into such MCQs, it will assess the problem-solving abilities of the students. Similarly, results will not be valid if in theory examination, steps for elicitation of knee-jerk reflex are asked, and student is certified to have skills for performing knee-jerk reflex. Hence, what is measured by the particular tool does not depend upon the tool, but what we put into the tool or what interpretation is drawn from results using that tool. The measurement of reliability of any assessment by any statistical methods should always be deduced, keeping the validity of the assessment in mind. The contemporary concept of validity considers validity as a unitary concept deduced from various empirical evidence including content, criterion, construct, and reliability evidence.[16] The reliability values have no meaning with poor validity. Thus, as per the contemporary view, the validity of an assessment is drawn from various evidence, including reliability evidence, and as such validity and reliability are no more considered separate entities, working in isolation [Figure 2].Figure 2: Contemporary vies of validity and reliabilityAlthough the contemporary view established a relationship between the reliability and validity of the assessment to the extent that we started considering reliability evidence necessary for drawing conclusions about validity, still the reverse is not true. Concept of Utility of Assessment Based on these facts, Vleuten deduced a concept of the utility of an assessment as an estimated product of reliability, validity, feasibility, acceptability, and educational impact,[17] while Vleuten and Schuwirth in 2005 further suggested a modified conceptual model to calculate the utility of assessment which was multiplicative model of different attributes by their differential weightage, as Utility = R × V × E × A × C, where R = Reliability, V = Validity, EI = Educational impact, A = Acceptability, C = Cost effectiveness.[18] This model also states that perfect assessment is not possible and deficiency in one attribute can be compensated if the weightage of another attribute is high; depending on the context and purpose of assessment. For example, in high-stake examinations, the assessment with high reliability will be of more value, while for any in-class test of multiple occurrences, the educational impact will be a more considerable criterion. The multiplicative nature of this model also ensures that if one variable is 0, the overall utility of the assessment automatically becomes 0. Similarly, if any one of the variables is in negative terms - thus promoting unsound learning habits - the utility of assessment will also be in negative. Valid reliability: A Theoretical Concept As stated above, for years, we have always treated the validity and reliability of assessments as separate measures. With contemporary views and utility models, an interrelationship has been established; but still, we consider validity and reliability as independent existing entities. However, do consider yourself, if any measure is consistently measuring a wrong construct, it can be said to be reliable while validity is at stake; but in practice, if any measure is not measuring what it is intended to measure, can we rely on its measurements? Not at all! Although apparently, reliability is there because of validity issues, we cannot rely on the assessment results to make valid and reliable inferences. Similarly, if any assessment is measuring the right construct, but not in a consistent manner, it is said to be valid, not reliable, but that occasional, accurate measurement of construct is of no use to make valid and reliable inferences. Validity and reliability go hand-in-hand, as demonstrated by a quantifiable link. It has been documented that the square root of reliability is almost equivalent to maximum attainable validity. For example, if the reliability coefficient for any test is 0.79, the validity coefficient cannot be larger than 0.88, which is itself square root of 0.79. Hence, it clearly implies that an assessment method is of use, only if it is valid as well as reliable.[19] As stated above, validity is now considered a unitary concept, and reliability evidence is an important part of validity; and as such, validity and reliability are interrelated. It has also been documented in the literature that validity and reliability experience trade-off – the stronger the bases for reliability, the weaker the bases for validity (vice-versa).[16] However, still, these two are considered separate concepts – reliability evidence is considered necessary for the validity of assessment, but what about validity contribution for the reliability of assessment? This becomes all the way more important when we are considering the validity of students’ assessment as a concept related to our inference from the assessment data and reliability of students’ assessment as a concept related to trust on the assessment inference, and both seems interrelated [Figure 3].Figure 3: Conceptual representation of validity and reliability for students’ assessmentLet us discuss it more with a simple example. If during a preexamination meeting, all the examiners have decided that they will not give more than 80% marks and <50% marks to any student in the practical examination, and the assessment has been carried out with all the valid methods, results are consistent too, can we still rely on such assessment results? As stated in another example quoted above – If after asking a student to write different steps of knee-jerk reflection during an article test, we are certifying him to have an ability to elicit knee-jerk – Are we making the right inference? Not at all! Can we lay confidence on such an inference or assessment – not at all! To broaden the notion – For students’ assessment, any assessment which is not valid cannot be reliable; and any assessment which is not reliable cannot be valid. This also implies that validity and reliability for students’ assessment should be considered a unified phenomenon, a unified concept, instead of discrete units. To make “accurate inference from any assessment with full confidence,” our assessment should be both reliable and valid. Moreover, it is high time; we should accept students’ assessments as having valid reliability or reliable validity!","PeriodicalId":13727,"journal":{"name":"International Journal of Applied and Basic Medical Research","volume":"60 1","pages":"0"},"PeriodicalIF":0.8000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Validity and Reliability of Students’ Assessment: Case for Recognition as a Unified Concept of Valid Reliability\",\"authors\":\"Kapil Gupta\",\"doi\":\"10.4103/ijabmr.ijabmr_382_23\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Students’ assessment, the major component of learning cycle, and curriculum serve many functions.[1-4] Assessment suggests areas of improvement during the training; the selection of students based on performance helps in the evaluation of the program and also has predictive utility. The assessment can be of learning – summative assessment, for learning - formative assessment, and can be without any external supervision – internal assessment. The importance and necessity of assessment in the learning cycle of the students can be gauged from the wordings of various proverbs in use, such as – ”assessment derives learning,” “assessment leads to learning,” and “assessment are the tail that wags the curriculum dog.” The students do modify their learning as per assessment. To make a precise decision about student’s learning and competency, the assessment must have both measurable and nonmeasurable components.[5] Van der Vleuten and Schuwirth defined assessment “as any formal or purported action to obtain information about the competence and performance of a student.”[6] Further, the assessment can be either criterion-referenced-comparing the competence of students against some fixed criteria or norm-referenced-comparing the performance of students with each other. Besides an aid to learning by virtue of having a provision of feedback and thus improve learning, assessment has reverse side too-improperly designed assessment can disfigure the learning. Therefore, any assessment should possess certain qualities or attributes. Traditional Concept Two important attributes defining students’ assessment are – reliability and validity. Conventionally, reliability of an assessment tool has commonly been referred to as “reproducibility” or “getting the same scores/marks under same conditions” or “precision of the measurement” or “consistency with which a test measures what it is supposed to assess.”[7] Reliability is measurable. As per classical test theory, the alpha coefficient (AC) is a range from 0 (no reliability) to 1 (perfect reliability); so if the test has an AC of 0.8%, it means it has a reliability of 80%, while measurement error is 20%.[8] The major factor affecting reliability is content or domain specificity. How an assessment can be reliable if it is based on a limited sampling of content or large content has been included in a single sample or if it is based on a single test? Moreover, a score that is derived from solving one problem cannot be interpolated for the second one. For example, assessment scores that are based on a single long case or viva for a single patient sample cannot produce reliable scores for another problem. If at the end of any professional year, subject knowledge is assessed by single multiple-choice questions (MCQs) based test of 10 items, can it measure students’ knowledge for the whole subject? Such assessments can be held valid but not reliable. Therefore, for any assessment test to be reliable, it is important to have a representation of the entire content as well as adequate sampling. Further, the reliability can also be increased by increasing the testing time, separating the whole content into multiple tests rather than a single test, and selecting battery of tests to access the same competency. Many studies have observed that almost the same reliability scores can be achieved with many assessment tools/methods if we increase the testing time and appropriate sampling is done.[9-15] Validity, another important characteristic of good assessment, is usually defined as measuring what it intends to measure. Validity is a unitary concept, and evidence can be drawn from many aspects such as content and construct-related and empirical evidence; therefore, the validity of assessment cannot be represented by a single coefficient like that of reliability. For example, if the performance of urinary blood proteins is part of 1st-year undergraduate medical training and these are not assessed in skill examination, the content-related validity is at stake. The construct validity consists of many attributes and mainly focuses on the problem-solving abilities of students based on subject knowledge, skills, data gathering and analysis, interpretation, communication, ethics, and many more. Thus, traditional concept was restricted to validity and reliability of the assessment tools and considered both as separate and unrelated entities [Figure 1].Figure 1: Traditional concept of validity and reliabilityContemporary View Of late, we have moved on from the historical concepts of validity and reliability of the assessment tools. Now, for educational purposes, we are not interested in the reliability of an assessment tool. Now, the more important aspect is how we use the tool to make the results reliable. Hence, as per the contemporary view, what we are really interested in is – ”Rely-ability” on our assessment results – How much we can trust our assessment results to make final judgments about students. Similarly, the contemporary concept of validity focuses on the interpretation that we make out of assessment data and not on the validity of assessment tools. So it is often said – No assessment tool/method is inherently invalid, more important is - what inference we draw from the assessment made using that tool. For example, the MCQs used will measure factual knowledge if it is designed to check factual knowledge; however, if some case-based scenario or some management plan for any disease is in-built into such MCQs, it will assess the problem-solving abilities of the students. Similarly, results will not be valid if in theory examination, steps for elicitation of knee-jerk reflex are asked, and student is certified to have skills for performing knee-jerk reflex. Hence, what is measured by the particular tool does not depend upon the tool, but what we put into the tool or what interpretation is drawn from results using that tool. The measurement of reliability of any assessment by any statistical methods should always be deduced, keeping the validity of the assessment in mind. The contemporary concept of validity considers validity as a unitary concept deduced from various empirical evidence including content, criterion, construct, and reliability evidence.[16] The reliability values have no meaning with poor validity. Thus, as per the contemporary view, the validity of an assessment is drawn from various evidence, including reliability evidence, and as such validity and reliability are no more considered separate entities, working in isolation [Figure 2].Figure 2: Contemporary vies of validity and reliabilityAlthough the contemporary view established a relationship between the reliability and validity of the assessment to the extent that we started considering reliability evidence necessary for drawing conclusions about validity, still the reverse is not true. Concept of Utility of Assessment Based on these facts, Vleuten deduced a concept of the utility of an assessment as an estimated product of reliability, validity, feasibility, acceptability, and educational impact,[17] while Vleuten and Schuwirth in 2005 further suggested a modified conceptual model to calculate the utility of assessment which was multiplicative model of different attributes by their differential weightage, as Utility = R × V × E × A × C, where R = Reliability, V = Validity, EI = Educational impact, A = Acceptability, C = Cost effectiveness.[18] This model also states that perfect assessment is not possible and deficiency in one attribute can be compensated if the weightage of another attribute is high; depending on the context and purpose of assessment. For example, in high-stake examinations, the assessment with high reliability will be of more value, while for any in-class test of multiple occurrences, the educational impact will be a more considerable criterion. The multiplicative nature of this model also ensures that if one variable is 0, the overall utility of the assessment automatically becomes 0. Similarly, if any one of the variables is in negative terms - thus promoting unsound learning habits - the utility of assessment will also be in negative. Valid reliability: A Theoretical Concept As stated above, for years, we have always treated the validity and reliability of assessments as separate measures. With contemporary views and utility models, an interrelationship has been established; but still, we consider validity and reliability as independent existing entities. However, do consider yourself, if any measure is consistently measuring a wrong construct, it can be said to be reliable while validity is at stake; but in practice, if any measure is not measuring what it is intended to measure, can we rely on its measurements? Not at all! Although apparently, reliability is there because of validity issues, we cannot rely on the assessment results to make valid and reliable inferences. Similarly, if any assessment is measuring the right construct, but not in a consistent manner, it is said to be valid, not reliable, but that occasional, accurate measurement of construct is of no use to make valid and reliable inferences. Validity and reliability go hand-in-hand, as demonstrated by a quantifiable link. It has been documented that the square root of reliability is almost equivalent to maximum attainable validity. For example, if the reliability coefficient for any test is 0.79, the validity coefficient cannot be larger than 0.88, which is itself square root of 0.79. Hence, it clearly implies that an assessment method is of use, only if it is valid as well as reliable.[19] As stated above, validity is now considered a unitary concept, and reliability evidence is an important part of validity; and as such, validity and reliability are interrelated. It has also been documented in the literature that validity and reliability experience trade-off – the stronger the bases for reliability, the weaker the bases for validity (vice-versa).[16] However, still, these two are considered separate concepts – reliability evidence is considered necessary for the validity of assessment, but what about validity contribution for the reliability of assessment? This becomes all the way more important when we are considering the validity of students’ assessment as a concept related to our inference from the assessment data and reliability of students’ assessment as a concept related to trust on the assessment inference, and both seems interrelated [Figure 3].Figure 3: Conceptual representation of validity and reliability for students’ assessmentLet us discuss it more with a simple example. If during a preexamination meeting, all the examiners have decided that they will not give more than 80% marks and <50% marks to any student in the practical examination, and the assessment has been carried out with all the valid methods, results are consistent too, can we still rely on such assessment results? As stated in another example quoted above – If after asking a student to write different steps of knee-jerk reflection during an article test, we are certifying him to have an ability to elicit knee-jerk – Are we making the right inference? Not at all! Can we lay confidence on such an inference or assessment – not at all! To broaden the notion – For students’ assessment, any assessment which is not valid cannot be reliable; and any assessment which is not reliable cannot be valid. This also implies that validity and reliability for students’ assessment should be considered a unified phenomenon, a unified concept, instead of discrete units. To make “accurate inference from any assessment with full confidence,” our assessment should be both reliable and valid. Moreover, it is high time; we should accept students’ assessments as having valid reliability or reliable validity!\",\"PeriodicalId\":13727,\"journal\":{\"name\":\"International Journal of Applied and Basic Medical Research\",\"volume\":\"60 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Applied and Basic Medical Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4103/ijabmr.ijabmr_382_23\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MEDICINE, GENERAL & INTERNAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Applied and Basic Medical Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4103/ijabmr.ijabmr_382_23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0
摘要
学生评价是学习周期的主要组成部分,课程具有多种功能。[1-4]评估提出了培训过程中需要改进的地方;根据学生的表现选择学生有助于对项目进行评估,并具有预测效用。评估可以是学习-总结性评估,学习-形成性评估,也可以是没有任何外部监督-内部评估。评价在学生学习周期中的重要性和必要性,可以从“评价衍生学习”、“评价引导学习”、“评价是摇课程狗的尾巴”等各种谚语中判断出来。学生们确实会根据评估来调整他们的学习。要对学生的学习和能力做出准确的判断,评估必须既有可测量的成分,也有不可测量的成分Van der Vleuten和Schuwirth将评估定义为“获得学生能力和表现信息的任何正式或据称的行动”。此外,评估可以是标准参照的——将学生的能力与一些固定的标准进行比较,也可以是规范参照的——将学生的表现与其他学生进行比较。除了通过提供反馈来帮助学习从而改善学习之外,评估也有反面——设计不当的评估会破坏学习。因此,任何评估都应该具备某些品质或属性。传统观念定义学生评价的两个重要属性是信度和效度。传统上,评估工具的可靠性通常被称为“可重复性”或“在相同条件下获得相同的分数/标记”或“测量的精度”或“测试测量它应该评估的内容的一致性”。可靠性是可测量的。根据经典测试理论,α系数(AC)的取值范围为0(无信度)到1(完全信度);因此,如果测试的AC为0.8%,则意味着它的可靠性为80%,而测量误差为20%影响可靠性的主要因素是内容或域的专用性。如果评估是基于有限的内容抽样,或在单个样本中包含了大量内容,或基于单个测试,评估如何可靠?此外,从解决一个问题中得到的分数不能用于第二个问题。例如,基于单个长期病例或单个患者样本的评估分数不能为另一个问题产生可靠的分数。如果在任何一个专业学年结束时,学科知识都是通过单项选择题(mcq)来评估的,它能衡量学生对整个学科的知识吗?这种评估可以被认为是有效的,但并不可靠。因此,对于任何可靠的评估测试,重要的是要有整个内容的表示以及足够的抽样。此外,通过增加测试时间,将整个内容分成多个测试而不是单个测试,以及选择一系列测试来访问相同的能力,也可以提高可靠性。许多研究已经观察到,如果我们增加测试时间并进行适当的采样,许多评估工具/方法几乎可以获得相同的可靠性分数。[9-15]效度是好的评估的另一个重要特征,通常被定义为测量它想测量的东西。效度是一个统一的概念,证据可以从内容、结构相关和经验证据等多个方面获得;因此,评估的效度不能像信度那样用一个单一的系数来表示。例如,如果尿血蛋白的表现是一年级本科医学培训的一部分,而这些没有在技能考试中评估,那么内容相关的效度就会受到威胁。构念效度包括许多属性,主要关注学生在学科知识、技能、数据收集和分析、解释、沟通、道德等方面的问题解决能力。因此,传统的概念仅限于评估工具的有效性和可靠性,并将其视为独立和不相关的实体[图1]。图1:效度和信度的传统概念当代观点最近,我们已经从评估工具的效度和信度的历史概念中走了出来。现在,出于教育目的,我们对评估工具的可靠性不感兴趣。现在,更重要的方面是我们如何使用这个工具使结果可靠。 因此,按照当代的观点,我们真正感兴趣的是——对我们的评估结果的“可靠性”——我们在多大程度上相信我们的评估结果来对学生做出最终的判断。同样,当代的效度概念侧重于我们对评估数据的解释,而不是评估工具的效度。所以人们常说,没有评估工具/方法本质上是无效的,更重要的是,我们从使用该工具进行的评估中得出什么结论。例如,如果设计mcq是为了检查事实知识,那么它将测量事实知识;但是,如果此类mcq中内置了一些基于案例的场景或针对任何疾病的一些管理计划,它将评估学生解决问题的能力。同样,如果在理论考试中,询问了引发条件反射的步骤,并且学生被证明具有条件反射的技能,那么结果将无效。因此,特定工具测量的内容并不取决于工具,而是取决于我们将什么放入工具中,或者从使用该工具的结果中得出什么解释。用任何统计方法衡量任何评估的可靠性,都应始终推导出来,同时牢记评估的有效性。当代的效度概念认为,效度是一个从包括内容、标准、结构和信度证据在内的各种经验证据中推导出来的统一概念信度值在信度差的情况下没有意义。因此,根据当代观点,评估的有效性来自各种证据,包括可靠性证据,因此有效性和可靠性不再被视为独立的实体,孤立地工作[图2]。图2:当代对效度和信度的看法尽管当代的观点建立了评估的信度和效度之间的关系,以至于我们开始考虑得出效度结论所必需的可靠性证据,但相反的情况仍然是不正确的。效用概念的评估基于这些事实,Vleuten推导的效用评估的概念作为估计产品的可靠性、有效性、可行性、可接受性,和教育的影响,[17]而Vleuten和Schuwirth 2005年进一步提出一个修改的概念模型来计算乘法模型的效用的评估不同属性的微分weightage,效用V = R××E××C, R =可靠性,V =效度,EI =教育影响,A =可接受性,C =成本效益该模型还指出,不可能有完美的评价,如果一个属性的权重较高,另一个属性的不足可以得到补偿;取决于评估的背景和目的。例如,在高风险考试中,高可靠性的评估将更有价值,而对于任何多次出现的课堂测试,教育影响将是一个更可观的标准。该模型的乘法性质还确保,如果一个变量为0,则评估的总体效用自动变为0。同样,如果任何一个变量是消极的- -从而促进不健全的学习习惯- -评估的效用也将是消极的。有效信度:一个理论概念如上所述,多年来,我们一直将评估的效度和信度视为单独的度量。与当代观点和实用新型建立了相互关系;但是,我们仍然认为效度和信度是独立存在的实体。然而,你要考虑到,如果任何测量总是测量一个错误的构念,它可以说是可靠的,而效度是危险的;但在实践中,如果任何测量都没有测量它想要测量的东西,我们能依赖它的测量吗?一点也不!虽然由于效度问题,信度明显存在,但我们不能依靠评估结果来做出有效可靠的推断。类似地,如果任何评估是测量正确的构念,但不是以一致的方式,它被认为是有效的,不可靠的,但是偶尔的,准确的构念测量对做出有效和可靠的推论是没有用的。有效性和可靠性是相辅相成的,正如一个可量化的联系所证明的那样。据文献记载,信度的平方根几乎等于可达到的最大效度。例如,如果某项测试的信度系数为0.79,则效度系数不能大于0.88,其本身就是0.79的平方根。因此,它清楚地暗示,评估方法只有在有效和可靠的情况下才有用。 [19]如上所述,效度现在被认为是一个统一的概念,信度证据是效度的重要组成部分;因此,效度和信度是相互关联的。文献中也记载了效度和信度之间的权衡——信度基础越强,效度基础越弱(反之亦然)然而,这两个仍然被认为是不同的概念——信度证据被认为是评估效度的必要条件,但效度贡献对评估的信度又如何呢?当我们将学生评估的有效性作为一个与我们对评估数据的推断相关的概念,将学生评估的可靠性作为一个与评估推断的信任相关的概念来考虑时,这一点就变得更加重要了,两者似乎是相互关联的[图3]。图3:学生评估的效度和信度的概念表示让我们用一个简单的例子来讨论它。如果在考前会议上,所有的考官都决定在实际考试中不给任何一个学生超过80%和低于50%的分数,并且已经用所有有效的方法进行了评估,结果也是一致的,那么我们还可以依赖这样的评估结果吗?正如上面引用的另一个例子所述,如果在要求学生在文章测试中写下不同的条件反射步骤后,我们证明他有能力引发条件反射,我们是否做出了正确的推断?一点也不!我们能相信这样的推断或评估吗?一点也不能!拓宽概念——对于学生的评估,任何无效的评估都是不可靠的;任何不可靠的评估都是无效的。这也意味着学生评价的效度和信度应该被视为一个统一的现象,一个统一的概念,而不是离散的单元。为了“充分自信地从任何评估中做出准确的推断”,我们的评估应该既可靠又有效。此外,现在正是时候;我们应该接受学生的评价为具有有效信度或可靠效度!
Validity and Reliability of Students’ Assessment: Case for Recognition as a Unified Concept of Valid Reliability
Students’ assessment, the major component of learning cycle, and curriculum serve many functions.[1-4] Assessment suggests areas of improvement during the training; the selection of students based on performance helps in the evaluation of the program and also has predictive utility. The assessment can be of learning – summative assessment, for learning - formative assessment, and can be without any external supervision – internal assessment. The importance and necessity of assessment in the learning cycle of the students can be gauged from the wordings of various proverbs in use, such as – ”assessment derives learning,” “assessment leads to learning,” and “assessment are the tail that wags the curriculum dog.” The students do modify their learning as per assessment. To make a precise decision about student’s learning and competency, the assessment must have both measurable and nonmeasurable components.[5] Van der Vleuten and Schuwirth defined assessment “as any formal or purported action to obtain information about the competence and performance of a student.”[6] Further, the assessment can be either criterion-referenced-comparing the competence of students against some fixed criteria or norm-referenced-comparing the performance of students with each other. Besides an aid to learning by virtue of having a provision of feedback and thus improve learning, assessment has reverse side too-improperly designed assessment can disfigure the learning. Therefore, any assessment should possess certain qualities or attributes. Traditional Concept Two important attributes defining students’ assessment are – reliability and validity. Conventionally, reliability of an assessment tool has commonly been referred to as “reproducibility” or “getting the same scores/marks under same conditions” or “precision of the measurement” or “consistency with which a test measures what it is supposed to assess.”[7] Reliability is measurable. As per classical test theory, the alpha coefficient (AC) is a range from 0 (no reliability) to 1 (perfect reliability); so if the test has an AC of 0.8%, it means it has a reliability of 80%, while measurement error is 20%.[8] The major factor affecting reliability is content or domain specificity. How an assessment can be reliable if it is based on a limited sampling of content or large content has been included in a single sample or if it is based on a single test? Moreover, a score that is derived from solving one problem cannot be interpolated for the second one. For example, assessment scores that are based on a single long case or viva for a single patient sample cannot produce reliable scores for another problem. If at the end of any professional year, subject knowledge is assessed by single multiple-choice questions (MCQs) based test of 10 items, can it measure students’ knowledge for the whole subject? Such assessments can be held valid but not reliable. Therefore, for any assessment test to be reliable, it is important to have a representation of the entire content as well as adequate sampling. Further, the reliability can also be increased by increasing the testing time, separating the whole content into multiple tests rather than a single test, and selecting battery of tests to access the same competency. Many studies have observed that almost the same reliability scores can be achieved with many assessment tools/methods if we increase the testing time and appropriate sampling is done.[9-15] Validity, another important characteristic of good assessment, is usually defined as measuring what it intends to measure. Validity is a unitary concept, and evidence can be drawn from many aspects such as content and construct-related and empirical evidence; therefore, the validity of assessment cannot be represented by a single coefficient like that of reliability. For example, if the performance of urinary blood proteins is part of 1st-year undergraduate medical training and these are not assessed in skill examination, the content-related validity is at stake. The construct validity consists of many attributes and mainly focuses on the problem-solving abilities of students based on subject knowledge, skills, data gathering and analysis, interpretation, communication, ethics, and many more. Thus, traditional concept was restricted to validity and reliability of the assessment tools and considered both as separate and unrelated entities [Figure 1].Figure 1: Traditional concept of validity and reliabilityContemporary View Of late, we have moved on from the historical concepts of validity and reliability of the assessment tools. Now, for educational purposes, we are not interested in the reliability of an assessment tool. Now, the more important aspect is how we use the tool to make the results reliable. Hence, as per the contemporary view, what we are really interested in is – ”Rely-ability” on our assessment results – How much we can trust our assessment results to make final judgments about students. Similarly, the contemporary concept of validity focuses on the interpretation that we make out of assessment data and not on the validity of assessment tools. So it is often said – No assessment tool/method is inherently invalid, more important is - what inference we draw from the assessment made using that tool. For example, the MCQs used will measure factual knowledge if it is designed to check factual knowledge; however, if some case-based scenario or some management plan for any disease is in-built into such MCQs, it will assess the problem-solving abilities of the students. Similarly, results will not be valid if in theory examination, steps for elicitation of knee-jerk reflex are asked, and student is certified to have skills for performing knee-jerk reflex. Hence, what is measured by the particular tool does not depend upon the tool, but what we put into the tool or what interpretation is drawn from results using that tool. The measurement of reliability of any assessment by any statistical methods should always be deduced, keeping the validity of the assessment in mind. The contemporary concept of validity considers validity as a unitary concept deduced from various empirical evidence including content, criterion, construct, and reliability evidence.[16] The reliability values have no meaning with poor validity. Thus, as per the contemporary view, the validity of an assessment is drawn from various evidence, including reliability evidence, and as such validity and reliability are no more considered separate entities, working in isolation [Figure 2].Figure 2: Contemporary vies of validity and reliabilityAlthough the contemporary view established a relationship between the reliability and validity of the assessment to the extent that we started considering reliability evidence necessary for drawing conclusions about validity, still the reverse is not true. Concept of Utility of Assessment Based on these facts, Vleuten deduced a concept of the utility of an assessment as an estimated product of reliability, validity, feasibility, acceptability, and educational impact,[17] while Vleuten and Schuwirth in 2005 further suggested a modified conceptual model to calculate the utility of assessment which was multiplicative model of different attributes by their differential weightage, as Utility = R × V × E × A × C, where R = Reliability, V = Validity, EI = Educational impact, A = Acceptability, C = Cost effectiveness.[18] This model also states that perfect assessment is not possible and deficiency in one attribute can be compensated if the weightage of another attribute is high; depending on the context and purpose of assessment. For example, in high-stake examinations, the assessment with high reliability will be of more value, while for any in-class test of multiple occurrences, the educational impact will be a more considerable criterion. The multiplicative nature of this model also ensures that if one variable is 0, the overall utility of the assessment automatically becomes 0. Similarly, if any one of the variables is in negative terms - thus promoting unsound learning habits - the utility of assessment will also be in negative. Valid reliability: A Theoretical Concept As stated above, for years, we have always treated the validity and reliability of assessments as separate measures. With contemporary views and utility models, an interrelationship has been established; but still, we consider validity and reliability as independent existing entities. However, do consider yourself, if any measure is consistently measuring a wrong construct, it can be said to be reliable while validity is at stake; but in practice, if any measure is not measuring what it is intended to measure, can we rely on its measurements? Not at all! Although apparently, reliability is there because of validity issues, we cannot rely on the assessment results to make valid and reliable inferences. Similarly, if any assessment is measuring the right construct, but not in a consistent manner, it is said to be valid, not reliable, but that occasional, accurate measurement of construct is of no use to make valid and reliable inferences. Validity and reliability go hand-in-hand, as demonstrated by a quantifiable link. It has been documented that the square root of reliability is almost equivalent to maximum attainable validity. For example, if the reliability coefficient for any test is 0.79, the validity coefficient cannot be larger than 0.88, which is itself square root of 0.79. Hence, it clearly implies that an assessment method is of use, only if it is valid as well as reliable.[19] As stated above, validity is now considered a unitary concept, and reliability evidence is an important part of validity; and as such, validity and reliability are interrelated. It has also been documented in the literature that validity and reliability experience trade-off – the stronger the bases for reliability, the weaker the bases for validity (vice-versa).[16] However, still, these two are considered separate concepts – reliability evidence is considered necessary for the validity of assessment, but what about validity contribution for the reliability of assessment? This becomes all the way more important when we are considering the validity of students’ assessment as a concept related to our inference from the assessment data and reliability of students’ assessment as a concept related to trust on the assessment inference, and both seems interrelated [Figure 3].Figure 3: Conceptual representation of validity and reliability for students’ assessmentLet us discuss it more with a simple example. If during a preexamination meeting, all the examiners have decided that they will not give more than 80% marks and <50% marks to any student in the practical examination, and the assessment has been carried out with all the valid methods, results are consistent too, can we still rely on such assessment results? As stated in another example quoted above – If after asking a student to write different steps of knee-jerk reflection during an article test, we are certifying him to have an ability to elicit knee-jerk – Are we making the right inference? Not at all! Can we lay confidence on such an inference or assessment – not at all! To broaden the notion – For students’ assessment, any assessment which is not valid cannot be reliable; and any assessment which is not reliable cannot be valid. This also implies that validity and reliability for students’ assessment should be considered a unified phenomenon, a unified concept, instead of discrete units. To make “accurate inference from any assessment with full confidence,” our assessment should be both reliable and valid. Moreover, it is high time; we should accept students’ assessments as having valid reliability or reliable validity!