Interobserver variability of recall decisions between mammography readers in the English NHS breast screening programme: A comparison of interobserver variability measures
IF 3.3 3区 医学Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Laura Quinn , David Jenkinson , Sian Taylor-Phillips , Yemisi Takwoingi , Alice Sitch
{"title":"Interobserver variability of recall decisions between mammography readers in the English NHS breast screening programme: A comparison of interobserver variability measures","authors":"Laura Quinn , David Jenkinson , Sian Taylor-Phillips , Yemisi Takwoingi , Alice Sitch","doi":"10.1016/j.ejrad.2026.112723","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div>To evaluate interobserver variability between mammogram readers’ recall decisions in the English NHS breast screening programme, comparing different variability measures.</div></div><div><h3>Methods</h3><div>Data from 401,682 women in 22 NHS centres who underwent mammographic screening interpreted independently by two mammogram readers were included. Percentage agreement, prevalence-adjusted bias-adjusted-kappa (PABAK), Gwet’s agreement coefficient (Gwet’s AC) and Cohen’s kappa were reported with 95% confidence intervals. Analyses were performed separately for women at first and subsequent screening appointments, by cancer diagnosis, reader recall rates and age group.</div></div><div><h3>Results</h3><div>Of 86,287 women at first screening, 6,491 (7.5%) were recalled, compared to 9,488 (3.0%) of 315,395 at subsequent screenings. Percentage agreement, Gwet’s AC, and PABAK were lower for first screening than subsequent (93.6%, 95%CI: 93.4–93.7 vs 97.2%, 95%CI: 97.2–97.3), (92.3, 95%CI:92.1 to 92.5 vs 97.0, 95% CI: 97.0 to 97.1) and (87.2, 95%CI: 86.9–87.4 vs 94.4, 95%CI: 94.3–94.5), whereas Cohen’s kappa, which is biased downwards when prevalence of recall is lower, did not change (61.6, 95%CI: 60.7–62.5 vs 61.8, 95%CI: 61.0–62.5). Percentage agreement, Gwet’s AC, and PABAK were lower for women with cancer detected than without, but Cohen’s kappa showed the opposite pattern, driven by prevalence bias. Percentage agreement, Gwet’s AC, and PABAK were lower when one/both readers had high recall rates, but Cohen’s kappa showed no important pattern.</div></div><div><h3>Conclusions</h3><div>Percentage agreement, Gwet’s AC, and PABAK showed lower agreement for interpreting the more challenging first screen, without assistance of previous mammograms, when women had cancer and when one/both readers had a high recall rate. Cohen’s kappa was heavily distorted by outcome prevalence. Despite widespread use, Cohen’s kappa is inappropriate for low prevalence settings such as screening, or making comparisons when prevalence varies.</div></div>","PeriodicalId":12063,"journal":{"name":"European Journal of Radiology","volume":"197 ","pages":"Article 112723"},"PeriodicalIF":3.3000,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Radiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0720048X26000719","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/7 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives
To evaluate interobserver variability between mammogram readers’ recall decisions in the English NHS breast screening programme, comparing different variability measures.
Methods
Data from 401,682 women in 22 NHS centres who underwent mammographic screening interpreted independently by two mammogram readers were included. Percentage agreement, prevalence-adjusted bias-adjusted-kappa (PABAK), Gwet’s agreement coefficient (Gwet’s AC) and Cohen’s kappa were reported with 95% confidence intervals. Analyses were performed separately for women at first and subsequent screening appointments, by cancer diagnosis, reader recall rates and age group.
Results
Of 86,287 women at first screening, 6,491 (7.5%) were recalled, compared to 9,488 (3.0%) of 315,395 at subsequent screenings. Percentage agreement, Gwet’s AC, and PABAK were lower for first screening than subsequent (93.6%, 95%CI: 93.4–93.7 vs 97.2%, 95%CI: 97.2–97.3), (92.3, 95%CI:92.1 to 92.5 vs 97.0, 95% CI: 97.0 to 97.1) and (87.2, 95%CI: 86.9–87.4 vs 94.4, 95%CI: 94.3–94.5), whereas Cohen’s kappa, which is biased downwards when prevalence of recall is lower, did not change (61.6, 95%CI: 60.7–62.5 vs 61.8, 95%CI: 61.0–62.5). Percentage agreement, Gwet’s AC, and PABAK were lower for women with cancer detected than without, but Cohen’s kappa showed the opposite pattern, driven by prevalence bias. Percentage agreement, Gwet’s AC, and PABAK were lower when one/both readers had high recall rates, but Cohen’s kappa showed no important pattern.
Conclusions
Percentage agreement, Gwet’s AC, and PABAK showed lower agreement for interpreting the more challenging first screen, without assistance of previous mammograms, when women had cancer and when one/both readers had a high recall rate. Cohen’s kappa was heavily distorted by outcome prevalence. Despite widespread use, Cohen’s kappa is inappropriate for low prevalence settings such as screening, or making comparisons when prevalence varies.
期刊介绍:
European Journal of Radiology is an international journal which aims to communicate to its readers, state-of-the-art information on imaging developments in the form of high quality original research articles and timely reviews on current developments in the field.
Its audience includes clinicians at all levels of training including radiology trainees, newly qualified imaging specialists and the experienced radiologist. Its aim is to inform efficient, appropriate and evidence-based imaging practice to the benefit of patients worldwide.