In CT-based medical image segmentation, the choice of loss function profoundly impacts the training efficacy of deep neural networks. Traditional loss functions like cross entropy (CE), Dice, Boundary, and TopK each have unique strengths and limitations, often introducing biases when used individually.
This study aims to enhance segmentation accuracy by optimizing ensemble loss functions, thereby addressing the biases and limitations of single loss functions and their linear combinations.
We implemented a comprehensive evaluation of loss function combinations by integrating CE, Dice, Boundary, and TopK loss functions through both loss-level linear combination and model-level ensemble methods. Our approach utilized two state-of-the-art 3D segmentation architectures, Attention U-Net (AttUNet) and SwinUNETR, to test the impact of these methods. The study was conducted on two large CT dataset cohorts: an institutional dataset containing pelvic organ segmentations, and a public dataset consisting of multiple organ segmentations. All the models were trained from scratch with different loss settings, and performance was evaluated using Dice similarity coefficient (DSC), Hausdorff distance (HD), and average surface distance (ASD). In the ensemble approach, both static averaging and learnable dynamic weighting strategies were employed to combine the outputs of models trained with different loss functions.
Extensive experiments revealed the following: (1) the linear combination of loss functions achieved results comparable to those of single loss-driven methods; (2) compared to the best non-ensemble methods, ensemble-based approaches resulted in a 2%–7% increase in DSC scores, along with notable reductions in HD (e.g., a 19.1% reduction for rectum segmentation using SwinUNETR) and ASD (e.g., a 49.0% reduction for prostate segmentation using AttUNet); (3) the learnable ensemble approach with optimized weights produced finer details in predicted masks, as confirmed by qualitative analyses; and (4) the learnable ensemble consistently outperforms the static ensemble across most metrics (DSC, HD, ASD) for both AttUNet and SwinUNETR architectures.
Our findings support the efficacy of using ensemble models with optimized weights to improve segmentation accuracy, highlighting the potential for broader applications in automated medical image analysis.