Sevgi Emin , Elia Rossi , Mattias Hedman , Marcela Giovenco , Fernanda Villegas , Eva Onjukka
{"title":"Performance of multi-vendor auto-segmentation models for thoracic organs at risk trained on a single dataset","authors":"Sevgi Emin , Elia Rossi , Mattias Hedman , Marcela Giovenco , Fernanda Villegas , Eva Onjukka","doi":"10.1016/j.ejmp.2025.105089","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>This study evaluates the delineation quality of artificial intelligence (AI)-based models for auto-segmentation trained on the same dataset, as the intrinsic performance cannot be evaluated for commercial solutions due to differences in training datasets. A diverse set of challenging thoracic organs-at-risk (OAR) were chosen, to reveal potential limitations of AI-based tools which are relevant for their clinical adoption.</div></div><div><h3>Materials & Methods</h3><div>A structure set with 16 OAR was delineated and reviewed by radiation oncology experts for 250 patients with lung tumours (200/50 for training/testing). Three participating vendors had access to the training dataset for a limited time to develop a model mimicking their commercial model development strategies.</div><div>The models were tested on the blind test dataset by the authors. A quantitative analysis was performed employing Dice Similarity Coefficient (DSC), surface DSC (sDSC), the 95-th percentile of the Hausdorff Distance (HD95) and average symmetric surface distance (ASSD). Inter-observer variability in manual segmentation was estimated by three independent expert delineations for a subset of five test patients.</div></div><div><h3>Results</h3><div>13 OAR had DSC > 0.8, 9 had sDSC > 0.8, 10 had ASSD < 0.5 mm and 5 had HD95 < 1 mm. The most challenging structures to auto-segment were the brachial plexus, pulmonary vein, and vena cava inferior. The overall results for all models were exceeding the inter-observer variability for all metrics.</div></div><div><h3>Conclusion</h3><div>While the evaluated AI-models perform very well for some OAR, they appear less successful at modelling organs with branching structures and poor image contrast, even when trained on a large homogeneous dataset.</div></div>","PeriodicalId":56092,"journal":{"name":"Physica Medica-European Journal of Medical Physics","volume":"137 ","pages":"Article 105089"},"PeriodicalIF":2.7000,"publicationDate":"2025-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Physica Medica-European Journal of Medical Physics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1120179725001991","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction
This study evaluates the delineation quality of artificial intelligence (AI)-based models for auto-segmentation trained on the same dataset, as the intrinsic performance cannot be evaluated for commercial solutions due to differences in training datasets. A diverse set of challenging thoracic organs-at-risk (OAR) were chosen, to reveal potential limitations of AI-based tools which are relevant for their clinical adoption.
Materials & Methods
A structure set with 16 OAR was delineated and reviewed by radiation oncology experts for 250 patients with lung tumours (200/50 for training/testing). Three participating vendors had access to the training dataset for a limited time to develop a model mimicking their commercial model development strategies.
The models were tested on the blind test dataset by the authors. A quantitative analysis was performed employing Dice Similarity Coefficient (DSC), surface DSC (sDSC), the 95-th percentile of the Hausdorff Distance (HD95) and average symmetric surface distance (ASSD). Inter-observer variability in manual segmentation was estimated by three independent expert delineations for a subset of five test patients.
Results
13 OAR had DSC > 0.8, 9 had sDSC > 0.8, 10 had ASSD < 0.5 mm and 5 had HD95 < 1 mm. The most challenging structures to auto-segment were the brachial plexus, pulmonary vein, and vena cava inferior. The overall results for all models were exceeding the inter-observer variability for all metrics.
Conclusion
While the evaluated AI-models perform very well for some OAR, they appear less successful at modelling organs with branching structures and poor image contrast, even when trained on a large homogeneous dataset.
期刊介绍:
Physica Medica, European Journal of Medical Physics, publishing with Elsevier from 2007, provides an international forum for research and reviews on the following main topics:
Medical Imaging
Radiation Therapy
Radiation Protection
Measuring Systems and Signal Processing
Education and training in Medical Physics
Professional issues in Medical Physics.