{"title":"Validity and Reliability of Students’ Assessment: Case for Recognition as a Unified Concept of Valid Reliability","authors":"Kapil Gupta","doi":"10.4103/ijabmr.ijabmr_382_23","DOIUrl":null,"url":null,"abstract":"Students’ assessment, the major component of learning cycle, and curriculum serve many functions.[1-4] Assessment suggests areas of improvement during the training; the selection of students based on performance helps in the evaluation of the program and also has predictive utility. The assessment can be of learning – summative assessment, for learning - formative assessment, and can be without any external supervision – internal assessment. The importance and necessity of assessment in the learning cycle of the students can be gauged from the wordings of various proverbs in use, such as – ”assessment derives learning,” “assessment leads to learning,” and “assessment are the tail that wags the curriculum dog.” The students do modify their learning as per assessment. To make a precise decision about student’s learning and competency, the assessment must have both measurable and nonmeasurable components.[5] Van der Vleuten and Schuwirth defined assessment “as any formal or purported action to obtain information about the competence and performance of a student.”[6] Further, the assessment can be either criterion-referenced-comparing the competence of students against some fixed criteria or norm-referenced-comparing the performance of students with each other. Besides an aid to learning by virtue of having a provision of feedback and thus improve learning, assessment has reverse side too-improperly designed assessment can disfigure the learning. Therefore, any assessment should possess certain qualities or attributes. Traditional Concept Two important attributes defining students’ assessment are – reliability and validity. Conventionally, reliability of an assessment tool has commonly been referred to as “reproducibility” or “getting the same scores/marks under same conditions” or “precision of the measurement” or “consistency with which a test measures what it is supposed to assess.”[7] Reliability is measurable. As per classical test theory, the alpha coefficient (AC) is a range from 0 (no reliability) to 1 (perfect reliability); so if the test has an AC of 0.8%, it means it has a reliability of 80%, while measurement error is 20%.[8] The major factor affecting reliability is content or domain specificity. How an assessment can be reliable if it is based on a limited sampling of content or large content has been included in a single sample or if it is based on a single test? Moreover, a score that is derived from solving one problem cannot be interpolated for the second one. For example, assessment scores that are based on a single long case or viva for a single patient sample cannot produce reliable scores for another problem. If at the end of any professional year, subject knowledge is assessed by single multiple-choice questions (MCQs) based test of 10 items, can it measure students’ knowledge for the whole subject? Such assessments can be held valid but not reliable. Therefore, for any assessment test to be reliable, it is important to have a representation of the entire content as well as adequate sampling. Further, the reliability can also be increased by increasing the testing time, separating the whole content into multiple tests rather than a single test, and selecting battery of tests to access the same competency. Many studies have observed that almost the same reliability scores can be achieved with many assessment tools/methods if we increase the testing time and appropriate sampling is done.[9-15] Validity, another important characteristic of good assessment, is usually defined as measuring what it intends to measure. Validity is a unitary concept, and evidence can be drawn from many aspects such as content and construct-related and empirical evidence; therefore, the validity of assessment cannot be represented by a single coefficient like that of reliability. For example, if the performance of urinary blood proteins is part of 1st-year undergraduate medical training and these are not assessed in skill examination, the content-related validity is at stake. The construct validity consists of many attributes and mainly focuses on the problem-solving abilities of students based on subject knowledge, skills, data gathering and analysis, interpretation, communication, ethics, and many more. Thus, traditional concept was restricted to validity and reliability of the assessment tools and considered both as separate and unrelated entities [Figure 1].Figure 1: Traditional concept of validity and reliabilityContemporary View Of late, we have moved on from the historical concepts of validity and reliability of the assessment tools. Now, for educational purposes, we are not interested in the reliability of an assessment tool. Now, the more important aspect is how we use the tool to make the results reliable. Hence, as per the contemporary view, what we are really interested in is – ”Rely-ability” on our assessment results – How much we can trust our assessment results to make final judgments about students. Similarly, the contemporary concept of validity focuses on the interpretation that we make out of assessment data and not on the validity of assessment tools. So it is often said – No assessment tool/method is inherently invalid, more important is - what inference we draw from the assessment made using that tool. For example, the MCQs used will measure factual knowledge if it is designed to check factual knowledge; however, if some case-based scenario or some management plan for any disease is in-built into such MCQs, it will assess the problem-solving abilities of the students. Similarly, results will not be valid if in theory examination, steps for elicitation of knee-jerk reflex are asked, and student is certified to have skills for performing knee-jerk reflex. Hence, what is measured by the particular tool does not depend upon the tool, but what we put into the tool or what interpretation is drawn from results using that tool. The measurement of reliability of any assessment by any statistical methods should always be deduced, keeping the validity of the assessment in mind. The contemporary concept of validity considers validity as a unitary concept deduced from various empirical evidence including content, criterion, construct, and reliability evidence.[16] The reliability values have no meaning with poor validity. Thus, as per the contemporary view, the validity of an assessment is drawn from various evidence, including reliability evidence, and as such validity and reliability are no more considered separate entities, working in isolation [Figure 2].Figure 2: Contemporary vies of validity and reliabilityAlthough the contemporary view established a relationship between the reliability and validity of the assessment to the extent that we started considering reliability evidence necessary for drawing conclusions about validity, still the reverse is not true. Concept of Utility of Assessment Based on these facts, Vleuten deduced a concept of the utility of an assessment as an estimated product of reliability, validity, feasibility, acceptability, and educational impact,[17] while Vleuten and Schuwirth in 2005 further suggested a modified conceptual model to calculate the utility of assessment which was multiplicative model of different attributes by their differential weightage, as Utility = R × V × E × A × C, where R = Reliability, V = Validity, EI = Educational impact, A = Acceptability, C = Cost effectiveness.[18] This model also states that perfect assessment is not possible and deficiency in one attribute can be compensated if the weightage of another attribute is high; depending on the context and purpose of assessment. For example, in high-stake examinations, the assessment with high reliability will be of more value, while for any in-class test of multiple occurrences, the educational impact will be a more considerable criterion. The multiplicative nature of this model also ensures that if one variable is 0, the overall utility of the assessment automatically becomes 0. Similarly, if any one of the variables is in negative terms - thus promoting unsound learning habits - the utility of assessment will also be in negative. Valid reliability: A Theoretical Concept As stated above, for years, we have always treated the validity and reliability of assessments as separate measures. With contemporary views and utility models, an interrelationship has been established; but still, we consider validity and reliability as independent existing entities. However, do consider yourself, if any measure is consistently measuring a wrong construct, it can be said to be reliable while validity is at stake; but in practice, if any measure is not measuring what it is intended to measure, can we rely on its measurements? Not at all! Although apparently, reliability is there because of validity issues, we cannot rely on the assessment results to make valid and reliable inferences. Similarly, if any assessment is measuring the right construct, but not in a consistent manner, it is said to be valid, not reliable, but that occasional, accurate measurement of construct is of no use to make valid and reliable inferences. Validity and reliability go hand-in-hand, as demonstrated by a quantifiable link. It has been documented that the square root of reliability is almost equivalent to maximum attainable validity. For example, if the reliability coefficient for any test is 0.79, the validity coefficient cannot be larger than 0.88, which is itself square root of 0.79. Hence, it clearly implies that an assessment method is of use, only if it is valid as well as reliable.[19] As stated above, validity is now considered a unitary concept, and reliability evidence is an important part of validity; and as such, validity and reliability are interrelated. It has also been documented in the literature that validity and reliability experience trade-off – the stronger the bases for reliability, the weaker the bases for validity (vice-versa).[16] However, still, these two are considered separate concepts – reliability evidence is considered necessary for the validity of assessment, but what about validity contribution for the reliability of assessment? This becomes all the way more important when we are considering the validity of students’ assessment as a concept related to our inference from the assessment data and reliability of students’ assessment as a concept related to trust on the assessment inference, and both seems interrelated [Figure 3].Figure 3: Conceptual representation of validity and reliability for students’ assessmentLet us discuss it more with a simple example. If during a preexamination meeting, all the examiners have decided that they will not give more than 80% marks and <50% marks to any student in the practical examination, and the assessment has been carried out with all the valid methods, results are consistent too, can we still rely on such assessment results? As stated in another example quoted above – If after asking a student to write different steps of knee-jerk reflection during an article test, we are certifying him to have an ability to elicit knee-jerk – Are we making the right inference? Not at all! Can we lay confidence on such an inference or assessment – not at all! To broaden the notion – For students’ assessment, any assessment which is not valid cannot be reliable; and any assessment which is not reliable cannot be valid. This also implies that validity and reliability for students’ assessment should be considered a unified phenomenon, a unified concept, instead of discrete units. To make “accurate inference from any assessment with full confidence,” our assessment should be both reliable and valid. Moreover, it is high time; we should accept students’ assessments as having valid reliability or reliable validity!","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4103/ijabmr.ijabmr_382_23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Students’ assessment, the major component of learning cycle, and curriculum serve many functions.[1-4] Assessment suggests areas of improvement during the training; the selection of students based on performance helps in the evaluation of the program and also has predictive utility. The assessment can be of learning – summative assessment, for learning - formative assessment, and can be without any external supervision – internal assessment. The importance and necessity of assessment in the learning cycle of the students can be gauged from the wordings of various proverbs in use, such as – ”assessment derives learning,” “assessment leads to learning,” and “assessment are the tail that wags the curriculum dog.” The students do modify their learning as per assessment. To make a precise decision about student’s learning and competency, the assessment must have both measurable and nonmeasurable components.[5] Van der Vleuten and Schuwirth defined assessment “as any formal or purported action to obtain information about the competence and performance of a student.”[6] Further, the assessment can be either criterion-referenced-comparing the competence of students against some fixed criteria or norm-referenced-comparing the performance of students with each other. Besides an aid to learning by virtue of having a provision of feedback and thus improve learning, assessment has reverse side too-improperly designed assessment can disfigure the learning. Therefore, any assessment should possess certain qualities or attributes. Traditional Concept Two important attributes defining students’ assessment are – reliability and validity. Conventionally, reliability of an assessment tool has commonly been referred to as “reproducibility” or “getting the same scores/marks under same conditions” or “precision of the measurement” or “consistency with which a test measures what it is supposed to assess.”[7] Reliability is measurable. As per classical test theory, the alpha coefficient (AC) is a range from 0 (no reliability) to 1 (perfect reliability); so if the test has an AC of 0.8%, it means it has a reliability of 80%, while measurement error is 20%.[8] The major factor affecting reliability is content or domain specificity. How an assessment can be reliable if it is based on a limited sampling of content or large content has been included in a single sample or if it is based on a single test? Moreover, a score that is derived from solving one problem cannot be interpolated for the second one. For example, assessment scores that are based on a single long case or viva for a single patient sample cannot produce reliable scores for another problem. If at the end of any professional year, subject knowledge is assessed by single multiple-choice questions (MCQs) based test of 10 items, can it measure students’ knowledge for the whole subject? Such assessments can be held valid but not reliable. Therefore, for any assessment test to be reliable, it is important to have a representation of the entire content as well as adequate sampling. Further, the reliability can also be increased by increasing the testing time, separating the whole content into multiple tests rather than a single test, and selecting battery of tests to access the same competency. Many studies have observed that almost the same reliability scores can be achieved with many assessment tools/methods if we increase the testing time and appropriate sampling is done.[9-15] Validity, another important characteristic of good assessment, is usually defined as measuring what it intends to measure. Validity is a unitary concept, and evidence can be drawn from many aspects such as content and construct-related and empirical evidence; therefore, the validity of assessment cannot be represented by a single coefficient like that of reliability. For example, if the performance of urinary blood proteins is part of 1st-year undergraduate medical training and these are not assessed in skill examination, the content-related validity is at stake. The construct validity consists of many attributes and mainly focuses on the problem-solving abilities of students based on subject knowledge, skills, data gathering and analysis, interpretation, communication, ethics, and many more. Thus, traditional concept was restricted to validity and reliability of the assessment tools and considered both as separate and unrelated entities [Figure 1].Figure 1: Traditional concept of validity and reliabilityContemporary View Of late, we have moved on from the historical concepts of validity and reliability of the assessment tools. Now, for educational purposes, we are not interested in the reliability of an assessment tool. Now, the more important aspect is how we use the tool to make the results reliable. Hence, as per the contemporary view, what we are really interested in is – ”Rely-ability” on our assessment results – How much we can trust our assessment results to make final judgments about students. Similarly, the contemporary concept of validity focuses on the interpretation that we make out of assessment data and not on the validity of assessment tools. So it is often said – No assessment tool/method is inherently invalid, more important is - what inference we draw from the assessment made using that tool. For example, the MCQs used will measure factual knowledge if it is designed to check factual knowledge; however, if some case-based scenario or some management plan for any disease is in-built into such MCQs, it will assess the problem-solving abilities of the students. Similarly, results will not be valid if in theory examination, steps for elicitation of knee-jerk reflex are asked, and student is certified to have skills for performing knee-jerk reflex. Hence, what is measured by the particular tool does not depend upon the tool, but what we put into the tool or what interpretation is drawn from results using that tool. The measurement of reliability of any assessment by any statistical methods should always be deduced, keeping the validity of the assessment in mind. The contemporary concept of validity considers validity as a unitary concept deduced from various empirical evidence including content, criterion, construct, and reliability evidence.[16] The reliability values have no meaning with poor validity. Thus, as per the contemporary view, the validity of an assessment is drawn from various evidence, including reliability evidence, and as such validity and reliability are no more considered separate entities, working in isolation [Figure 2].Figure 2: Contemporary vies of validity and reliabilityAlthough the contemporary view established a relationship between the reliability and validity of the assessment to the extent that we started considering reliability evidence necessary for drawing conclusions about validity, still the reverse is not true. Concept of Utility of Assessment Based on these facts, Vleuten deduced a concept of the utility of an assessment as an estimated product of reliability, validity, feasibility, acceptability, and educational impact,[17] while Vleuten and Schuwirth in 2005 further suggested a modified conceptual model to calculate the utility of assessment which was multiplicative model of different attributes by their differential weightage, as Utility = R × V × E × A × C, where R = Reliability, V = Validity, EI = Educational impact, A = Acceptability, C = Cost effectiveness.[18] This model also states that perfect assessment is not possible and deficiency in one attribute can be compensated if the weightage of another attribute is high; depending on the context and purpose of assessment. For example, in high-stake examinations, the assessment with high reliability will be of more value, while for any in-class test of multiple occurrences, the educational impact will be a more considerable criterion. The multiplicative nature of this model also ensures that if one variable is 0, the overall utility of the assessment automatically becomes 0. Similarly, if any one of the variables is in negative terms - thus promoting unsound learning habits - the utility of assessment will also be in negative. Valid reliability: A Theoretical Concept As stated above, for years, we have always treated the validity and reliability of assessments as separate measures. With contemporary views and utility models, an interrelationship has been established; but still, we consider validity and reliability as independent existing entities. However, do consider yourself, if any measure is consistently measuring a wrong construct, it can be said to be reliable while validity is at stake; but in practice, if any measure is not measuring what it is intended to measure, can we rely on its measurements? Not at all! Although apparently, reliability is there because of validity issues, we cannot rely on the assessment results to make valid and reliable inferences. Similarly, if any assessment is measuring the right construct, but not in a consistent manner, it is said to be valid, not reliable, but that occasional, accurate measurement of construct is of no use to make valid and reliable inferences. Validity and reliability go hand-in-hand, as demonstrated by a quantifiable link. It has been documented that the square root of reliability is almost equivalent to maximum attainable validity. For example, if the reliability coefficient for any test is 0.79, the validity coefficient cannot be larger than 0.88, which is itself square root of 0.79. Hence, it clearly implies that an assessment method is of use, only if it is valid as well as reliable.[19] As stated above, validity is now considered a unitary concept, and reliability evidence is an important part of validity; and as such, validity and reliability are interrelated. It has also been documented in the literature that validity and reliability experience trade-off – the stronger the bases for reliability, the weaker the bases for validity (vice-versa).[16] However, still, these two are considered separate concepts – reliability evidence is considered necessary for the validity of assessment, but what about validity contribution for the reliability of assessment? This becomes all the way more important when we are considering the validity of students’ assessment as a concept related to our inference from the assessment data and reliability of students’ assessment as a concept related to trust on the assessment inference, and both seems interrelated [Figure 3].Figure 3: Conceptual representation of validity and reliability for students’ assessmentLet us discuss it more with a simple example. If during a preexamination meeting, all the examiners have decided that they will not give more than 80% marks and <50% marks to any student in the practical examination, and the assessment has been carried out with all the valid methods, results are consistent too, can we still rely on such assessment results? As stated in another example quoted above – If after asking a student to write different steps of knee-jerk reflection during an article test, we are certifying him to have an ability to elicit knee-jerk – Are we making the right inference? Not at all! Can we lay confidence on such an inference or assessment – not at all! To broaden the notion – For students’ assessment, any assessment which is not valid cannot be reliable; and any assessment which is not reliable cannot be valid. This also implies that validity and reliability for students’ assessment should be considered a unified phenomenon, a unified concept, instead of discrete units. To make “accurate inference from any assessment with full confidence,” our assessment should be both reliable and valid. Moreover, it is high time; we should accept students’ assessments as having valid reliability or reliable validity!