挑战常规：由分类准确性或可靠性决定的考试长度。

IF 5.2 1区教育学 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

Medical Education Pub Date : 2025-06-04 DOI:10.1111/medu.15742

Stefan K Schauber, Matt Homer

{"title":"挑战常规：由分类准确性或可靠性决定的考试长度。","authors":"Stefan K Schauber, Matt Homer","doi":"10.1111/medu.15742","DOIUrl":null,"url":null,"abstract":"Purpose: This paper challenges the notion that reliability indices are appropriate for informing test length in exams in medical education, where the focus is on ensuring defensible pass-fail decisions. Instead, we argue that using classification accuracy instead better suited to the purpose of exams in these cases. We show empirically, using resampled test data from a range of undergraduate knowledge exams, that this is indeed the case. More specifically, we address the hypothesis that the use of classification accuracy results in recommending shorter test lengths as compared to when using reliability.Method: We analysed data from previous exams from both pre-clinical and clinical phases of undergraduate medical education. We used a re-sampling procedure in which both the cut-score and test length of repeatedly generated synthetic exams were varied systematically. N = 52 500 datasets were generated from the original exams. For each of these both reliability and classification accuracy indices were estimated.Result: Results indicate that only classification accuracy, not reliability, varies in relation to the cut-score for pass-fail decisions. Furthermore, reliability and classification accuracy are differently related to test length. The optimal test length for using reliability was around 100 items, independent of pass-rates. For classification accuracy, recommendations are less generic. For exams with a small percentage of failed decisions (i.e., 5% or less), an item size of 50 did, on average, achieve an accuracy of 95% correct classifications.Conclusions: We suggest a move towards the employment of classification accuracy using existing tools, whilst still using reliability as a complement. The benefits of re-thinking current test design practice include minimizing the burden of assessment on candidates and test developers. Item writers could focus on developing fewer, but higher quality, items. Finally, we stress the need to consider the effects of the balance false positive and false negative decisions in pass/fail classifications.","PeriodicalId":18370,"journal":{"name":"Medical Education","volume":" ","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Challenging the norm: Length of exams determined by classification accuracy or reliability.\",\"authors\":\"Stefan K Schauber, Matt Homer\",\"doi\":\"10.1111/medu.15742\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: This paper challenges the notion that reliability indices are appropriate for informing test length in exams in medical education, where the focus is on ensuring defensible pass-fail decisions. Instead, we argue that using classification accuracy instead better suited to the purpose of exams in these cases. We show empirically, using resampled test data from a range of undergraduate knowledge exams, that this is indeed the case. More specifically, we address the hypothesis that the use of classification accuracy results in recommending shorter test lengths as compared to when using reliability.Method: We analysed data from previous exams from both pre-clinical and clinical phases of undergraduate medical education. We used a re-sampling procedure in which both the cut-score and test length of repeatedly generated synthetic exams were varied systematically. N = 52 500 datasets were generated from the original exams. For each of these both reliability and classification accuracy indices were estimated.Result: Results indicate that only classification accuracy, not reliability, varies in relation to the cut-score for pass-fail decisions. Furthermore, reliability and classification accuracy are differently related to test length. The optimal test length for using reliability was around 100 items, independent of pass-rates. For classification accuracy, recommendations are less generic. For exams with a small percentage of failed decisions (i.e., 5% or less), an item size of 50 did, on average, achieve an accuracy of 95% correct classifications.Conclusions: We suggest a move towards the employment of classification accuracy using existing tools, whilst still using reliability as a complement. The benefits of re-thinking current test design practice include minimizing the burden of assessment on candidates and test developers. Item writers could focus on developing fewer, but higher quality, items. Finally, we stress the need to consider the effects of the balance false positive and false negative decisions in pass/fail classifications.\",\"PeriodicalId\":18370,\"journal\":{\"name\":\"Medical Education\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2025-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Medical Education\",\"FirstCategoryId\":\"95\",\"ListUrlMain\":\"https://doi.org/10.1111/medu.15742\",\"RegionNum\":1,\"RegionCategory\":\"教育学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION, SCIENTIFIC DISCIPLINES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1111/medu.15742","RegionNum":1,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

摘要

目的：本文挑战的概念，即可靠性指标是适当的通知考试长度在医学教育，其中的重点是确保可辩护的通过或不通过的决定。相反，我们认为在这些情况下，使用分类准确性更适合于考试的目的。我们通过从一系列本科知识考试中重新抽样的测试数据，从经验上证明确实如此。更具体地说，我们解决了这样一个假设，即与使用可靠性相比，使用分类准确性会导致推荐更短的测试长度。方法：对以往医学本科教育临床前期和临床阶段的考试资料进行分析。我们使用了重新抽样程序，其中重复生成的合成考试的及格分数和考试长度都是系统地变化的。从原始考试中生成N = 52 500个数据集。对每一种方法的可靠性和分类精度指标进行了估计。结果：结果表明，只有分类准确性，而不是可靠性，与及格-不及格决策的及格分有关。此外，可靠性和分类精度与测试长度的关系不同。使用可靠性的最佳测试长度约为100项，与通过率无关。为了分类的准确性，建议不那么通用。对于错误决策比例较小（即5%或更少）的考试，平均而言，项目大小为50的正确分类准确率达到95%。结论：我们建议朝着使用现有工具使用分类准确性的方向发展，同时仍然使用可靠性作为补充。重新思考当前测试设计实践的好处包括最小化候选人员和测试开发人员的评估负担。道具编写者可以专注于开发更少但更高质量的道具。最后，我们强调需要考虑通过/失败分类中平衡假阳性和假阴性决策的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Challenging the norm: Length of exams determined by classification accuracy or reliability.

Purpose: This paper challenges the notion that reliability indices are appropriate for informing test length in exams in medical education, where the focus is on ensuring defensible pass-fail decisions. Instead, we argue that using classification accuracy instead better suited to the purpose of exams in these cases. We show empirically, using resampled test data from a range of undergraduate knowledge exams, that this is indeed the case. More specifically, we address the hypothesis that the use of classification accuracy results in recommending shorter test lengths as compared to when using reliability.

Method: We analysed data from previous exams from both pre-clinical and clinical phases of undergraduate medical education. We used a re-sampling procedure in which both the cut-score and test length of repeatedly generated synthetic exams were varied systematically. N = 52 500 datasets were generated from the original exams. For each of these both reliability and classification accuracy indices were estimated.

Result: Results indicate that only classification accuracy, not reliability, varies in relation to the cut-score for pass-fail decisions. Furthermore, reliability and classification accuracy are differently related to test length. The optimal test length for using reliability was around 100 items, independent of pass-rates. For classification accuracy, recommendations are less generic. For exams with a small percentage of failed decisions (i.e., 5% or less), an item size of 50 did, on average, achieve an accuracy of 95% correct classifications.

Conclusions: We suggest a move towards the employment of classification accuracy using existing tools, whilst still using reliability as a complement. The benefits of re-thinking current test design practice include minimizing the burden of assessment on candidates and test developers. Item writers could focus on developing fewer, but higher quality, items. Finally, we stress the need to consider the effects of the balance false positive and false negative decisions in pass/fail classifications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Medical Education 医学-卫生保健

CiteScore

8.40

自引率

10.00%

发文量

279

审稿时长

4-8 weeks

期刊介绍： Medical Education seeks to be the pre-eminent journal in the field of education for health care professionals, and publishes material of the highest quality, reflecting world wide or provocative issues and perspectives. The journal welcomes high quality papers on all aspects of health professional education including; -undergraduate education -postgraduate training -continuing professional development -interprofessional education