Towards clinical implementation of automated segmentation of vestibular schwannomas: a reliability study comparing AI and human performance.

IF 2.4 3区医学 Q2 CLINICAL NEUROLOGY

Neuroradiology Pub Date : 2025-04-01 Epub Date: 2025-04-04 DOI:10.1007/s00234-025-03611-3

Stefan Cornelissen, Sammy M Schouten, Patrick P J H Langenhuizen, Henricus P M Kunst, Jeroen B Verheul, Peter H N De With

{"title":"Towards clinical implementation of automated segmentation of vestibular schwannomas: a reliability study comparing AI and human performance.","authors":"Stefan Cornelissen, Sammy M Schouten, Patrick P J H Langenhuizen, Henricus P M Kunst, Jeroen B Verheul, Peter H N De With","doi":"10.1007/s00234-025-03611-3","DOIUrl":null,"url":null,"abstract":"Purpose: To evaluate the clinimetric reliability of automated vestibular schwannoma (VS) segmentations by a comparison with human inter-observer variability on T1-weighted contrast-enhanced MRI scans.Methods: This retrospective study employed MR images, including follow-up, from 1,015 patients (median age: 59, 511 men), resulting in 1,856 unique scans. Two nnU-Net models were trained using fivefold cross-validation to create a single-center segmentation model, along with a multi-center model using additional publicly available data. Geometric-based segmentation metrics (e.g. the Dice score) were used to evaluate model performance. To quantitatively assess the clinimetric reliability of the models, automated tumor volumes from a separate test set were compared to human inter-observer variability using the limits of agreement with the mean (LOAM) procedure. Additionally, new agreement limits that include automated annotations are calculated.Results: Both models performed comparable to current state-of-the-art VS segmentation models, with median Dice scores of 91.6% and 91.9% for the single and multi-center models, respectively. There is a stark difference in clinimetric performance between both models: automated tumor volumes of the multi-center model fell within human agreement limits in 73% of the cases, compared to 44% for the single-center model. Newly calculated agreement limits including the single-center model, resulted in very high and wide limits. For the multi-center model, the new agreement limits were comparable to human inter-observer variability.Conclusion: Models with excellent geometric-based metrics do not necessarily imply high clinimetric reliability, demonstrating the need to clinimetrically evaluate models as part of the clinical implementation process. The multi-center model displayed high reliability, warranting its possible future use in clinical practice. However, caution should be exercised when employing the model for small tumors, as the reliability was found to be volume-dependent.","PeriodicalId":19422,"journal":{"name":"Neuroradiology","volume":" ","pages":"1049-1059"},"PeriodicalIF":2.4000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12040986/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neuroradiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00234-025-03611-3","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/4 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: To evaluate the clinimetric reliability of automated vestibular schwannoma (VS) segmentations by a comparison with human inter-observer variability on T1-weighted contrast-enhanced MRI scans.

Methods: This retrospective study employed MR images, including follow-up, from 1,015 patients (median age: 59, 511 men), resulting in 1,856 unique scans. Two nnU-Net models were trained using fivefold cross-validation to create a single-center segmentation model, along with a multi-center model using additional publicly available data. Geometric-based segmentation metrics (e.g. the Dice score) were used to evaluate model performance. To quantitatively assess the clinimetric reliability of the models, automated tumor volumes from a separate test set were compared to human inter-observer variability using the limits of agreement with the mean (LOAM) procedure. Additionally, new agreement limits that include automated annotations are calculated.

Results: Both models performed comparable to current state-of-the-art VS segmentation models, with median Dice scores of 91.6% and 91.9% for the single and multi-center models, respectively. There is a stark difference in clinimetric performance between both models: automated tumor volumes of the multi-center model fell within human agreement limits in 73% of the cases, compared to 44% for the single-center model. Newly calculated agreement limits including the single-center model, resulted in very high and wide limits. For the multi-center model, the new agreement limits were comparable to human inter-observer variability.

Conclusion: Models with excellent geometric-based metrics do not necessarily imply high clinimetric reliability, demonstrating the need to clinimetrically evaluate models as part of the clinical implementation process. The multi-center model displayed high reliability, warranting its possible future use in clinical practice. However, caution should be exercised when employing the model for small tumors, as the reliability was found to be volume-dependent.

查看原文本刊更多论文

前庭神经鞘瘤自动分割的临床实施：一项比较人工智能和人类表现的可靠性研究。

目的：通过与人类t1加权对比增强MRI扫描的观察者间变异性比较，评估自动前庭神经鞘瘤（VS）分割的临床可靠性。方法：本回顾性研究采用磁共振图像，包括随访，来自1,015例患者（中位年龄：59,511名男性），产生1,856次独特扫描。使用五倍交叉验证来训练两个nnU-Net模型，以创建一个单中心分割模型，以及使用其他公开可用数据的多中心模型。基于几何的分割指标（例如Dice分数）被用来评估模型的性能。为了定量评估模型的临床可靠性，使用与平均值一致的极限（LOAM）程序，将来自单独测试集的自动肿瘤体积与人类观察者间变异性进行比较。此外，还计算了包含自动注释的新协议限制。结果：两种模型的表现都与当前最先进的VS分割模型相当，单中心和多中心模型的中位数Dice得分分别为91.6%和91.9%。两种模型之间的临床表现存在明显差异：在73%的病例中，多中心模型的自动肿瘤体积落在人类协议限制范围内，而单中心模型的这一比例为44%。包括单中心模型在内的新计算的一致性限制导致了非常高和宽的限制。对于多中心模型，新的一致性限制与人类观察者之间的可变性相当。结论：具有优异几何指标的模型并不一定意味着高的临床可靠性，这表明在临床实施过程中需要对模型进行临床评估。多中心模型显示了较高的可靠性，保证了其在临床实践中的应用前景。然而，在使用小肿瘤模型时应谨慎，因为发现可靠性依赖于体积。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neuroradiology 医学-核医学

CiteScore

5.30

自引率

3.60%

发文量

214

审稿时长

4-8 weeks

期刊介绍： Neuroradiology aims to provide state-of-the-art medical and scientific information in the fields of Neuroradiology, Neurosciences, Neurology, Psychiatry, Neurosurgery, and related medical specialities. Neuroradiology as the official Journal of the European Society of Neuroradiology receives submissions from all parts of the world and publishes peer-reviewed original research, comprehensive reviews, educational papers, opinion papers, and short reports on exceptional clinical observations and new technical developments in the field of Neuroimaging and Neurointervention. The journal has subsections for Diagnostic and Interventional Neuroradiology, Advanced Neuroimaging, Paediatric Neuroradiology, Head-Neck-ENT Radiology, Spine Neuroradiology, and for submissions from Japan. Neuroradiology aims to provide new knowledge about and insights into the function and pathology of the human nervous system that may help to better diagnose and treat nervous system diseases. Neuroradiology is a member of the Committee on Publication Ethics (COPE) and follows the COPE core practices. Neuroradiology prefers articles that are free of bias, self-critical regarding limitations, transparent and clear in describing study participants, methods, and statistics, and short in presenting results. Before peer-review all submissions are automatically checked by iThenticate to assess for potential overlap in prior publication.