Reference standard for the evaluation of automatic segmentation algorithms: Quantification of inter observer variability of manual delineation of prostate contour on MRI

IF 8.1 2区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Diagnostic and Interventional Imaging Pub Date : 2024-02-01 DOI:10.1016/j.diii.2023.08.001

Sébastien Molière , Dimitri Hamzaoui , Benjamin Granger , Sarah Montagne , Alexandre Allera , Malek Ezziane , Anna Luzurier , Raphaelle Quint , Mehdi Kalai , Nicholas Ayache , Hervé Delingette , Raphaële Renard-Penna

{"title":"Reference standard for the evaluation of automatic segmentation algorithms: Quantification of inter observer variability of manual delineation of prostate contour on MRI","authors":"Sébastien Molière , Dimitri Hamzaoui , Benjamin Granger , Sarah Montagne , Alexandre Allera , Malek Ezziane , Anna Luzurier , Raphaelle Quint , Mehdi Kalai , Nicholas Ayache , Hervé Delingette , Raphaële Renard-Penna","doi":"10.1016/j.diii.2023.08.001","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><p>The purpose of this study was to investigate the relationship between inter-reader variability in manual prostate contour segmentation on magnetic resonance imaging (MRI) examinations and determine the optimal number of readers required to establish a reliable reference standard.</p></div><div><h3>Materials and methods</h3><p>Seven radiologists with various experiences independently performed manual segmentation of the prostate contour (whole-gland [WG] and transition zone [TZ]) on 40 prostate MRI examinations obtained in 40 patients. Inter-reader variability in prostate contour delineations was estimated using standard metrics (Dice similarity coefficient [DSC], Hausdorff distance and volume-based metrics). The impact of the number of readers (from two to seven) on segmentation variability was assessed using pairwise metrics (consistency) and metrics with respect to a reference segmentation (conformity), obtained either with majority voting or simultaneous truth and performance level estimation (STAPLE) algorithm.</p></div><div><h3>Results</h3><p>The average segmentation DSC for two readers in pairwise comparison was 0.919 for WG and 0.876 for TZ. Variability decreased with the number of readers: the interquartile ranges of the DSC were 0.076 (WG) / 0.021 (TZ) for configurations with two readers, 0.005 (WG) / 0.012 (TZ) for configurations with three readers, and 0.002 (WG) / 0.0037 (TZ) for configurations with six readers. The interquartile range decreased slightly faster between two and three readers than between three and six readers. When using consensus methods, variability often reached its minimum with three readers (with STAPLE, DSC = 0.96 [range: 0.945–0.971] for WG and DSC = 0.94 [range: 0.912–0.957] for TZ, and interquartile range was minimal for configurations with three readers.</p></div><div><h3>Conclusion</h3><p>The number of readers affects the inter-reader variability, in terms of inter-reader consistency and conformity to a reference. Variability is minimal for three readers, or three readers represent a tipping point in the variability evolution, with both pairwise-based metrics or metrics with respect to a reference. Accordingly, three readers may represent an optimal number to determine references for artificial intelligence applications.</p></div>","PeriodicalId":48656,"journal":{"name":"Diagnostic and Interventional Imaging","volume":"105 2","pages":"Pages 65-73"},"PeriodicalIF":8.1000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostic and Interventional Imaging","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2211568423001547","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

The purpose of this study was to investigate the relationship between inter-reader variability in manual prostate contour segmentation on magnetic resonance imaging (MRI) examinations and determine the optimal number of readers required to establish a reliable reference standard.

Materials and methods

Seven radiologists with various experiences independently performed manual segmentation of the prostate contour (whole-gland [WG] and transition zone [TZ]) on 40 prostate MRI examinations obtained in 40 patients. Inter-reader variability in prostate contour delineations was estimated using standard metrics (Dice similarity coefficient [DSC], Hausdorff distance and volume-based metrics). The impact of the number of readers (from two to seven) on segmentation variability was assessed using pairwise metrics (consistency) and metrics with respect to a reference segmentation (conformity), obtained either with majority voting or simultaneous truth and performance level estimation (STAPLE) algorithm.

Results

The average segmentation DSC for two readers in pairwise comparison was 0.919 for WG and 0.876 for TZ. Variability decreased with the number of readers: the interquartile ranges of the DSC were 0.076 (WG) / 0.021 (TZ) for configurations with two readers, 0.005 (WG) / 0.012 (TZ) for configurations with three readers, and 0.002 (WG) / 0.0037 (TZ) for configurations with six readers. The interquartile range decreased slightly faster between two and three readers than between three and six readers. When using consensus methods, variability often reached its minimum with three readers (with STAPLE, DSC = 0.96 [range: 0.945–0.971] for WG and DSC = 0.94 [range: 0.912–0.957] for TZ, and interquartile range was minimal for configurations with three readers.

Conclusion

The number of readers affects the inter-reader variability, in terms of inter-reader consistency and conformity to a reference. Variability is minimal for three readers, or three readers represent a tipping point in the variability evolution, with both pairwise-based metrics or metrics with respect to a reference. Accordingly, three readers may represent an optimal number to determine references for artificial intelligence applications.

查看原文本刊更多论文

评估自动分割算法的参考标准：MRI上手动描绘前列腺轮廓的观察者间变异性的量化。

目的：本研究的目的是研究磁共振成像（MRI）检查中手动前列腺轮廓分割的读者间变异性之间的关系，并确定建立可靠参考标准所需的最佳读者数量。材料和方法：7名具有不同经验的放射科医生在40名患者的40次前列腺MRI检查中独立进行了前列腺轮廓（整个腺体[WG]和过渡区[TZ]）的手动分割。前列腺轮廓描绘的读者间变异性使用标准指标（Dice相似系数[DSC]、Hausdorff距离和基于体积的指标）进行估计。使用成对度量（一致性）和参考分割的度量（一致度）来评估读者数量（从2到7）对分割可变性的影响，这些度量是通过多数投票或同时真实性和性能水平估计（STAPLE）算法获得的。结果：在成对比较中，两个阅读器的平均分割DSC为0.919（WG）和0.876（TZ）。变异性随着阅读器数量的增加而降低：两个阅读器配置的DSC四分位间距为0.076（WG）/0.021（TZ。二至三名读者的四分位间距下降速度略快于三至六名读者。当使用一致性方法时，变异性通常在三个读者中达到最小值（对于STAPLE，WG的DSC=0.96[范围：0.945-0.971]，DSC=0.94[范围：0912-0.957]对于TZ，四分位间距对于具有三个读取器的配置是最小的。结论：读者数量影响读者之间的变异性，即读者之间的一致性和对参考文献的一致性。对于三个读者来说，可变性是最小的，或者三个读者代表了可变性进化的临界点，两者都是基于成对的度量或相对于参考的度量。因此，三个阅读器可以代表确定人工智能应用参考文献的最佳数量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Diagnostic and Interventional Imaging Medicine-Radiology, Nuclear Medicine and Imaging

CiteScore

8.50

自引率

29.10%

发文量

126

审稿时长

11 days

期刊介绍： Diagnostic and Interventional Imaging accepts publications originating from any part of the world based only on their scientific merit. The Journal focuses on illustrated articles with great iconographic topics and aims at aiding sharpening clinical decision-making skills as well as following high research topics. All articles are published in English. Diagnostic and Interventional Imaging publishes editorials, technical notes, letters, original and review articles on abdominal, breast, cancer, cardiac, emergency, forensic medicine, head and neck, musculoskeletal, gastrointestinal, genitourinary, interventional, obstetric, pediatric, thoracic and vascular imaging, neuroradiology, nuclear medicine, as well as contrast material, computer developments, health policies and practice, and medical physics relevant to imaging.