{"title":"挑战常规:由分类准确性或可靠性决定的考试长度。","authors":"Stefan K Schauber, Matt Homer","doi":"10.1111/medu.15742","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This paper challenges the notion that reliability indices are appropriate for informing test length in exams in medical education, where the focus is on ensuring defensible pass-fail decisions. Instead, we argue that using classification accuracy instead better suited to the purpose of exams in these cases. We show empirically, using resampled test data from a range of undergraduate knowledge exams, that this is indeed the case. More specifically, we address the hypothesis that the use of classification accuracy results in recommending shorter test lengths as compared to when using reliability.</p><p><strong>Method: </strong>We analysed data from previous exams from both pre-clinical and clinical phases of undergraduate medical education. We used a re-sampling procedure in which both the cut-score and test length of repeatedly generated synthetic exams were varied systematically. N = 52 500 datasets were generated from the original exams. For each of these both reliability and classification accuracy indices were estimated.</p><p><strong>Result: </strong>Results indicate that only classification accuracy, not reliability, varies in relation to the cut-score for pass-fail decisions. Furthermore, reliability and classification accuracy are differently related to test length. The optimal test length for using reliability was around 100 items, independent of pass-rates. For classification accuracy, recommendations are less generic. For exams with a small percentage of failed decisions (i.e., 5% or less), an item size of 50 did, on average, achieve an accuracy of 95% correct classifications.</p><p><strong>Conclusions: </strong>We suggest a move towards the employment of classification accuracy using existing tools, whilst still using reliability as a complement. The benefits of re-thinking current test design practice include minimizing the burden of assessment on candidates and test developers. Item writers could focus on developing fewer, but higher quality, items. Finally, we stress the need to consider the effects of the balance false positive and false negative decisions in pass/fail classifications.</p>","PeriodicalId":18370,"journal":{"name":"Medical Education","volume":" ","pages":""},"PeriodicalIF":4.9000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Challenging the norm: Length of exams determined by classification accuracy or reliability.\",\"authors\":\"Stefan K Schauber, Matt Homer\",\"doi\":\"10.1111/medu.15742\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>This paper challenges the notion that reliability indices are appropriate for informing test length in exams in medical education, where the focus is on ensuring defensible pass-fail decisions. Instead, we argue that using classification accuracy instead better suited to the purpose of exams in these cases. We show empirically, using resampled test data from a range of undergraduate knowledge exams, that this is indeed the case. More specifically, we address the hypothesis that the use of classification accuracy results in recommending shorter test lengths as compared to when using reliability.</p><p><strong>Method: </strong>We analysed data from previous exams from both pre-clinical and clinical phases of undergraduate medical education. We used a re-sampling procedure in which both the cut-score and test length of repeatedly generated synthetic exams were varied systematically. N = 52 500 datasets were generated from the original exams. For each of these both reliability and classification accuracy indices were estimated.</p><p><strong>Result: </strong>Results indicate that only classification accuracy, not reliability, varies in relation to the cut-score for pass-fail decisions. Furthermore, reliability and classification accuracy are differently related to test length. The optimal test length for using reliability was around 100 items, independent of pass-rates. For classification accuracy, recommendations are less generic. For exams with a small percentage of failed decisions (i.e., 5% or less), an item size of 50 did, on average, achieve an accuracy of 95% correct classifications.</p><p><strong>Conclusions: </strong>We suggest a move towards the employment of classification accuracy using existing tools, whilst still using reliability as a complement. The benefits of re-thinking current test design practice include minimizing the burden of assessment on candidates and test developers. Item writers could focus on developing fewer, but higher quality, items. Finally, we stress the need to consider the effects of the balance false positive and false negative decisions in pass/fail classifications.</p>\",\"PeriodicalId\":18370,\"journal\":{\"name\":\"Medical Education\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Medical Education\",\"FirstCategoryId\":\"95\",\"ListUrlMain\":\"https://doi.org/10.1111/medu.15742\",\"RegionNum\":1,\"RegionCategory\":\"教育学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION, SCIENTIFIC DISCIPLINES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1111/medu.15742","RegionNum":1,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
Challenging the norm: Length of exams determined by classification accuracy or reliability.
Purpose: This paper challenges the notion that reliability indices are appropriate for informing test length in exams in medical education, where the focus is on ensuring defensible pass-fail decisions. Instead, we argue that using classification accuracy instead better suited to the purpose of exams in these cases. We show empirically, using resampled test data from a range of undergraduate knowledge exams, that this is indeed the case. More specifically, we address the hypothesis that the use of classification accuracy results in recommending shorter test lengths as compared to when using reliability.
Method: We analysed data from previous exams from both pre-clinical and clinical phases of undergraduate medical education. We used a re-sampling procedure in which both the cut-score and test length of repeatedly generated synthetic exams were varied systematically. N = 52 500 datasets were generated from the original exams. For each of these both reliability and classification accuracy indices were estimated.
Result: Results indicate that only classification accuracy, not reliability, varies in relation to the cut-score for pass-fail decisions. Furthermore, reliability and classification accuracy are differently related to test length. The optimal test length for using reliability was around 100 items, independent of pass-rates. For classification accuracy, recommendations are less generic. For exams with a small percentage of failed decisions (i.e., 5% or less), an item size of 50 did, on average, achieve an accuracy of 95% correct classifications.
Conclusions: We suggest a move towards the employment of classification accuracy using existing tools, whilst still using reliability as a complement. The benefits of re-thinking current test design practice include minimizing the burden of assessment on candidates and test developers. Item writers could focus on developing fewer, but higher quality, items. Finally, we stress the need to consider the effects of the balance false positive and false negative decisions in pass/fail classifications.
期刊介绍:
Medical Education seeks to be the pre-eminent journal in the field of education for health care professionals, and publishes material of the highest quality, reflecting world wide or provocative issues and perspectives.
The journal welcomes high quality papers on all aspects of health professional education including;
-undergraduate education
-postgraduate training
-continuing professional development
-interprofessional education