脑MRI分割中性能估计的置信区间

IF 10.7 1区医学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Medical image analysis Pub Date : 2025-05-08 DOI:10.1016/j.media.2025.103565

Rosana El Jurdi , Gaël Varoquaux , Olivier Colliot

{"title":"脑MRI分割中性能估计的置信区间","authors":"Rosana El Jurdi , Gaël Varoquaux , Olivier Colliot","doi":"10.1016/j.media.2025.103565","DOIUrl":null,"url":null,"abstract":"<div><div>Medical segmentation models are evaluated empirically. As such an evaluation is based on a limited set of example images, it is unavoidably noisy. Beyond a mean performance measure, reporting confidence intervals is thus crucial. However, this is rarely done in medical image segmentation. The width of the confidence interval depends on the test set size and on the spread of the performance measure (its standard-deviation across the test set). For classification, many test images are needed to avoid wide confidence intervals. Segmentation, however, has not been studied, and it differs by the amount of information brought by a given test image. In this paper, we study the typical confidence intervals in the context of segmentation in 3D brain magnetic resonance imaging (MRI). We carry experiments on using the standard nnU-net framework, two datasets from the Medical Decathlon challenge that concern brain MRI (hippocampus and brain tumor segmentation) and two performance measures: the Dice Similarity Coefficient and the Hausdorff distance. We show that the parametric confidence intervals are reasonable approximations of the bootstrap estimates for varying test set sizes and spread of the performance metric. Importantly, we show that the test size needed to achieve a given precision is often much lower than for classification tasks. Typically, a 1% wide confidence interval requires about 100–200 test samples when the spread is low (standard-deviation around 3%). More difficult segmentation tasks may lead to higher spreads and require over 1000 samples. The corresponding code and notebooks are available on GitHub at <span><span>https://github.com/rosanajurdi/SegVal_Repo</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"103 ","pages":"Article 103565"},"PeriodicalIF":10.7000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Confidence intervals for performance estimates in brain MRI segmentation\",\"authors\":\"Rosana El Jurdi , Gaël Varoquaux , Olivier Colliot\",\"doi\":\"10.1016/j.media.2025.103565\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Medical segmentation models are evaluated empirically. As such an evaluation is based on a limited set of example images, it is unavoidably noisy. Beyond a mean performance measure, reporting confidence intervals is thus crucial. However, this is rarely done in medical image segmentation. The width of the confidence interval depends on the test set size and on the spread of the performance measure (its standard-deviation across the test set). For classification, many test images are needed to avoid wide confidence intervals. Segmentation, however, has not been studied, and it differs by the amount of information brought by a given test image. In this paper, we study the typical confidence intervals in the context of segmentation in 3D brain magnetic resonance imaging (MRI). We carry experiments on using the standard nnU-net framework, two datasets from the Medical Decathlon challenge that concern brain MRI (hippocampus and brain tumor segmentation) and two performance measures: the Dice Similarity Coefficient and the Hausdorff distance. We show that the parametric confidence intervals are reasonable approximations of the bootstrap estimates for varying test set sizes and spread of the performance metric. Importantly, we show that the test size needed to achieve a given precision is often much lower than for classification tasks. Typically, a 1% wide confidence interval requires about 100–200 test samples when the spread is low (standard-deviation around 3%). More difficult segmentation tasks may lead to higher spreads and require over 1000 samples. The corresponding code and notebooks are available on GitHub at <span><span>https://github.com/rosanajurdi/SegVal_Repo</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":18328,\"journal\":{\"name\":\"Medical image analysis\",\"volume\":\"103 \",\"pages\":\"Article 103565\"},\"PeriodicalIF\":10.7000,\"publicationDate\":\"2025-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Medical image analysis\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1361841525001124\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical image analysis","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1361841525001124","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

对医学分割模型进行了实证评价。由于这种评估是基于有限的示例图像集，因此不可避免地存在噪声。因此，除了平均绩效衡量之外，报告置信区间也至关重要。然而，这在医学图像分割中很少用到。置信区间的宽度取决于测试集的大小和性能度量的扩展（其在测试集上的标准偏差）。对于分类，需要大量的测试图像，以避免较宽的置信区间。然而，分割还没有被研究过，它因给定的测试图像所带来的信息量而不同。本文研究了三维脑磁共振成像（MRI）图像分割中的典型置信区间。我们使用标准的nnU-net框架、来自医学十项竞赛的两个数据集（涉及脑MRI（海马体和脑肿瘤分割））和两个性能度量：骰子相似系数和豪斯多夫距离进行实验。我们表明，参数置信区间是对不同测试集大小和性能度量范围的自举估计的合理近似值。重要的是，我们表明，达到给定精度所需的测试大小通常比分类任务低得多。通常，当差值较低（标准偏差约为3%）时，1%的宽置信区间需要大约100-200个测试样本。更困难的分割任务可能导致更高的传播，需要超过1000个样本。相应的代码和笔记本可以在GitHub上获得https://github.com/rosanajurdi/SegVal_Repo。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Confidence intervals for performance estimates in brain MRI segmentation

Medical segmentation models are evaluated empirically. As such an evaluation is based on a limited set of example images, it is unavoidably noisy. Beyond a mean performance measure, reporting confidence intervals is thus crucial. However, this is rarely done in medical image segmentation. The width of the confidence interval depends on the test set size and on the spread of the performance measure (its standard-deviation across the test set). For classification, many test images are needed to avoid wide confidence intervals. Segmentation, however, has not been studied, and it differs by the amount of information brought by a given test image. In this paper, we study the typical confidence intervals in the context of segmentation in 3D brain magnetic resonance imaging (MRI). We carry experiments on using the standard nnU-net framework, two datasets from the Medical Decathlon challenge that concern brain MRI (hippocampus and brain tumor segmentation) and two performance measures: the Dice Similarity Coefficient and the Hausdorff distance. We show that the parametric confidence intervals are reasonable approximations of the bootstrap estimates for varying test set sizes and spread of the performance metric. Importantly, we show that the test size needed to achieve a given precision is often much lower than for classification tasks. Typically, a 1% wide confidence interval requires about 100–200 test samples when the spread is low (standard-deviation around 3%). More difficult segmentation tasks may lead to higher spreads and require over 1000 samples. The corresponding code and notebooks are available on GitHub at https://github.com/rosanajurdi/SegVal_Repo.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Medical image analysis 工程技术-工程：生物医学

CiteScore

22.10

自引率

6.40%

发文量

309

审稿时长

6.6 months

期刊介绍： Medical Image Analysis serves as a platform for sharing new research findings in the realm of medical and biological image analysis, with a focus on applications of computer vision, virtual reality, and robotics to biomedical imaging challenges. The journal prioritizes the publication of high-quality, original papers contributing to the fundamental science of processing, analyzing, and utilizing medical and biological images. It welcomes approaches utilizing biomedical image datasets across all spatial scales, from molecular/cellular imaging to tissue/organ imaging.