Beyond reliability: assessing rater competence when using a behavioural marker system.

IF 2.8 Q2 HEALTH CARE SCIENCES & SERVICES

Advances in simulation (London, England) Pub Date : 2024-12-31 DOI:10.1186/s41077-024-00329-9

Samantha Eve Smith, Scott McColgan-Smith, Fiona Stewart, Julie Mardon, Victoria Ruth Tallentire

{"title":"Beyond reliability: assessing rater competence when using a behavioural marker system.","authors":"Samantha Eve Smith, Scott McColgan-Smith, Fiona Stewart, Julie Mardon, Victoria Ruth Tallentire","doi":"10.1186/s41077-024-00329-9","DOIUrl":null,"url":null,"abstract":"Background: Behavioural marker systems are used across several healthcare disciplines to assess behavioural (non-technical) skills, but rater training is variable, and inter-rater reliability is generally poor. Inter-rater reliability provides data about the tool, but not the competence of individual raters. This study aimed to test the inter-rater reliability of a new behavioural marker system (PhaBS - pharmacists' behavioural skills) with clinically experienced faculty raters and near-peer raters. It also aimed to assess rater competence when using PhaBS after brief familiarisation, by assessing completeness, agreement with an expert rater, ability to rank performance, stringency or leniency, and avoidance of the halo effect.Methods: Clinically experienced faculty raters and near-peer raters attended a 30-min PhaBS familiarisation session. This was immediately followed by a marking session in which they rated a trainee pharmacist's behavioural skills in three scripted immersive acute care simulated scenarios, demonstrating good, mediocre, and poor performances respectively. Inter-rater reliability in each group was calculated using the two-way random, absolute agreement single-measures intra-class correlation co-efficient (ICC). Differences in individual rater competence in each domain were compared using Pearson's chi-squared test.Results: The ICC for experienced faculty raters was good at 0.60 (0.48-0.72) and for near-peer raters was poor at 0.38 (0.27-0.54). Of experienced faculty raters, 5/9 were competent in all domains versus 2/13 near-peer raters (difference not statistically significant). There was no statistically significant difference between the abilities of clinically experienced versus near-peer raters in agreement with an expert rater, ability to rank performance, stringency or leniency, or avoidance of the halo effect. The only statistically significant difference between groups was ability to compete the assessment (9/9 experienced faculty raters versus 6/13 near-peer raters, p = 0.0077).Conclusions: Experienced faculty have acceptable inter-rater reliability when using PhaBS, consistent with other behaviour marker systems; however, not all raters are competent. Competence measures for other assessments can be helpfully applied to behavioural marker systems. When using behavioural marker systems for assessment, educators must start using such rater competence frameworks. This is important to ensure fair and accurate assessments for learners, to provide educators with information about rater training programmes, and to provide individual raters with meaningful feedback.","PeriodicalId":72108,"journal":{"name":"Advances in simulation (London, England)","volume":"9 1","pages":"55"},"PeriodicalIF":2.8000,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11687013/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in simulation (London, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s41077-024-00329-9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Behavioural marker systems are used across several healthcare disciplines to assess behavioural (non-technical) skills, but rater training is variable, and inter-rater reliability is generally poor. Inter-rater reliability provides data about the tool, but not the competence of individual raters. This study aimed to test the inter-rater reliability of a new behavioural marker system (PhaBS - pharmacists' behavioural skills) with clinically experienced faculty raters and near-peer raters. It also aimed to assess rater competence when using PhaBS after brief familiarisation, by assessing completeness, agreement with an expert rater, ability to rank performance, stringency or leniency, and avoidance of the halo effect.

Methods: Clinically experienced faculty raters and near-peer raters attended a 30-min PhaBS familiarisation session. This was immediately followed by a marking session in which they rated a trainee pharmacist's behavioural skills in three scripted immersive acute care simulated scenarios, demonstrating good, mediocre, and poor performances respectively. Inter-rater reliability in each group was calculated using the two-way random, absolute agreement single-measures intra-class correlation co-efficient (ICC). Differences in individual rater competence in each domain were compared using Pearson's chi-squared test.

Results: The ICC for experienced faculty raters was good at 0.60 (0.48-0.72) and for near-peer raters was poor at 0.38 (0.27-0.54). Of experienced faculty raters, 5/9 were competent in all domains versus 2/13 near-peer raters (difference not statistically significant). There was no statistically significant difference between the abilities of clinically experienced versus near-peer raters in agreement with an expert rater, ability to rank performance, stringency or leniency, or avoidance of the halo effect. The only statistically significant difference between groups was ability to compete the assessment (9/9 experienced faculty raters versus 6/13 near-peer raters, p = 0.0077).

Conclusions: Experienced faculty have acceptable inter-rater reliability when using PhaBS, consistent with other behaviour marker systems; however, not all raters are competent. Competence measures for other assessments can be helpfully applied to behavioural marker systems. When using behavioural marker systems for assessment, educators must start using such rater competence frameworks. This is important to ensure fair and accurate assessments for learners, to provide educators with information about rater training programmes, and to provide individual raters with meaningful feedback.

查看原文本刊更多论文

超越可靠性：在使用行为标记系统时评估评分者的能力。

背景：行为标记系统在多个医疗保健学科中用于评估行为（非技术）技能，但评分者培训是可变的，而且评分者之间的可靠性通常较差。评价者之间的信度提供了有关工具的数据，而不是个别评价者的能力。本研究旨在测试一个新的行为标记系统（PhaBS -药师行为技能）的信度，由临床经验丰富的教师评分者和近同行评分者组成。它还旨在评估在短暂熟悉后使用PhaBS时的评分能力，通过评估完整性，与专家评分员的一致性，对绩效进行排名的能力，严格或宽松，以及避免光环效应。方法：临床经验丰富的教师评分员和近同行评分员参加了30分钟的PhaBS熟悉课程。紧接着是一个评分环节，在这个环节中，他们对实习药剂师在三个脚本化的沉浸式急性护理模拟场景中的行为技能进行评分，分别表现出良好、一般和较差的表现。使用双向随机、绝对一致单测量类内相关系数（ICC）计算每组的组间信度。使用Pearson卡方检验比较各领域的个体评分能力差异。结果：经验丰富的教师评分者的ICC为0.60(0.48-0.72)，而近同行评分者的ICC为0.38(0.27-0.54)，较差。在经验丰富的教师评分者中，5/9的人在所有领域都胜任，而2/13的人在同行评分者中胜任（差异无统计学意义）。临床经验丰富的评分者与接近同行的评分者在与专家评分者达成一致的能力、对表现的排名能力、严格或宽松程度、或避免光环效应之间没有统计学上的显著差异。两组之间唯一具有统计学意义的差异是竞争评估的能力（9/9有经验的教师评分者与6/13接近同行的评分者，p = 0.0077）。结论：经验丰富的教师在使用PhaBS时具有可接受的评分者间信度，与其他行为标记系统一致；然而，并不是所有的评级员都是称职的。其他评估的能力措施可以有用地应用于行为标记系统。在使用行为标记系统进行评估时，教育工作者必须开始使用这种能力评估框架。这对于确保对学习者进行公平和准确的评估，向教育工作者提供有关评分员培训计划的信息，并向评分员个人提供有意义的反馈非常重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊