Reference standard for the evaluation of automatic segmentation algorithms: Quantification of inter observer variability of manual delineation of prostate contour on MRI
IF 4.9 2区 医学Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Sébastien Molière , Dimitri Hamzaoui , Benjamin Granger , Sarah Montagne , Alexandre Allera , Malek Ezziane , Anna Luzurier , Raphaelle Quint , Mehdi Kalai , Nicholas Ayache , Hervé Delingette , Raphaële Renard-Penna
{"title":"Reference standard for the evaluation of automatic segmentation algorithms: Quantification of inter observer variability of manual delineation of prostate contour on MRI","authors":"Sébastien Molière , Dimitri Hamzaoui , Benjamin Granger , Sarah Montagne , Alexandre Allera , Malek Ezziane , Anna Luzurier , Raphaelle Quint , Mehdi Kalai , Nicholas Ayache , Hervé Delingette , Raphaële Renard-Penna","doi":"10.1016/j.diii.2023.08.001","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><p>The purpose of this study was to investigate the relationship between inter-reader variability in manual prostate contour segmentation on magnetic resonance imaging (MRI) examinations and determine the optimal number of readers required to establish a reliable reference standard.</p></div><div><h3>Materials and methods</h3><p>Seven radiologists with various experiences independently performed manual segmentation of the prostate contour (whole-gland [WG] and transition zone [TZ]) on 40 prostate MRI examinations obtained in 40 patients. Inter-reader variability in prostate contour delineations was estimated using standard metrics (Dice similarity coefficient [DSC], Hausdorff distance and volume-based metrics). The impact of the number of readers (from two to seven) on segmentation variability was assessed using pairwise metrics (consistency) and metrics with respect to a reference segmentation (conformity), obtained either with majority voting or simultaneous truth and performance level estimation (STAPLE) algorithm.</p></div><div><h3>Results</h3><p>The average segmentation DSC for two readers in pairwise comparison was 0.919 for WG and 0.876 for TZ. Variability decreased with the number of readers: the interquartile ranges of the DSC were 0.076 (WG) / 0.021 (TZ) for configurations with two readers, 0.005 (WG) / 0.012 (TZ) for configurations with three readers, and 0.002 (WG) / 0.0037 (TZ) for configurations with six readers. The interquartile range decreased slightly faster between two and three readers than between three and six readers. When using consensus methods, variability often reached its minimum with three readers (with STAPLE, DSC = 0.96 [range: 0.945–0.971] for WG and DSC = 0.94 [range: 0.912–0.957] for TZ, and interquartile range was minimal for configurations with three readers.</p></div><div><h3>Conclusion</h3><p>The number of readers affects the inter-reader variability, in terms of inter-reader consistency and conformity to a reference. Variability is minimal for three readers, or three readers represent a tipping point in the variability evolution, with both pairwise-based metrics or metrics with respect to a reference. Accordingly, three readers may represent an optimal number to determine references for artificial intelligence applications.</p></div>","PeriodicalId":48656,"journal":{"name":"Diagnostic and Interventional Imaging","volume":null,"pages":null},"PeriodicalIF":4.9000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostic and Interventional Imaging","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2211568423001547","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose
The purpose of this study was to investigate the relationship between inter-reader variability in manual prostate contour segmentation on magnetic resonance imaging (MRI) examinations and determine the optimal number of readers required to establish a reliable reference standard.
Materials and methods
Seven radiologists with various experiences independently performed manual segmentation of the prostate contour (whole-gland [WG] and transition zone [TZ]) on 40 prostate MRI examinations obtained in 40 patients. Inter-reader variability in prostate contour delineations was estimated using standard metrics (Dice similarity coefficient [DSC], Hausdorff distance and volume-based metrics). The impact of the number of readers (from two to seven) on segmentation variability was assessed using pairwise metrics (consistency) and metrics with respect to a reference segmentation (conformity), obtained either with majority voting or simultaneous truth and performance level estimation (STAPLE) algorithm.
Results
The average segmentation DSC for two readers in pairwise comparison was 0.919 for WG and 0.876 for TZ. Variability decreased with the number of readers: the interquartile ranges of the DSC were 0.076 (WG) / 0.021 (TZ) for configurations with two readers, 0.005 (WG) / 0.012 (TZ) for configurations with three readers, and 0.002 (WG) / 0.0037 (TZ) for configurations with six readers. The interquartile range decreased slightly faster between two and three readers than between three and six readers. When using consensus methods, variability often reached its minimum with three readers (with STAPLE, DSC = 0.96 [range: 0.945–0.971] for WG and DSC = 0.94 [range: 0.912–0.957] for TZ, and interquartile range was minimal for configurations with three readers.
Conclusion
The number of readers affects the inter-reader variability, in terms of inter-reader consistency and conformity to a reference. Variability is minimal for three readers, or three readers represent a tipping point in the variability evolution, with both pairwise-based metrics or metrics with respect to a reference. Accordingly, three readers may represent an optimal number to determine references for artificial intelligence applications.
期刊介绍:
Diagnostic and Interventional Imaging accepts publications originating from any part of the world based only on their scientific merit. The Journal focuses on illustrated articles with great iconographic topics and aims at aiding sharpening clinical decision-making skills as well as following high research topics. All articles are published in English.
Diagnostic and Interventional Imaging publishes editorials, technical notes, letters, original and review articles on abdominal, breast, cancer, cardiac, emergency, forensic medicine, head and neck, musculoskeletal, gastrointestinal, genitourinary, interventional, obstetric, pediatric, thoracic and vascular imaging, neuroradiology, nuclear medicine, as well as contrast material, computer developments, health policies and practice, and medical physics relevant to imaging.