Lulin Yuan, Quan Chen, Hania Al-Hallaq, Jinzhong Yang, Xiaofeng Yang, Huaizhi Geng, Kujtim Latifi, Bin Cai, Qingrong Jackie Wu, Ying Xiao, Stanley H Benedict, Yi Rong, Jeff Buchsbaum, X Sharon Qi
{"title":"Quantitative Evaluation of AI-based Organ Segmentation Across Multiple Anatomical Sites Using Eight Commercial Software Platforms.","authors":"Lulin Yuan, Quan Chen, Hania Al-Hallaq, Jinzhong Yang, Xiaofeng Yang, Huaizhi Geng, Kujtim Latifi, Bin Cai, Qingrong Jackie Wu, Ying Xiao, Stanley H Benedict, Yi Rong, Jeff Buchsbaum, X Sharon Qi","doi":"10.1016/j.prro.2025.06.012","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To evaluate organs-at-risk (OARs) segmentation variability across eight commercial AI-based segmentation software using independent multi-institutional datasets, and to provide recommendations for clinical practices utilizing AI-segmentation.</p><p><strong>Methods: </strong>160 planning CT image sets from four anatomical sites: head-and-neck, thorax, abdomen and pelvis were retrospectively pooled from three institutions. Contours for 31 OARs generated by the software were compared to clinical contours using multiple accuracy metrics, including: Dice similarity coefficient (DSC), 95 Percentile of Hausdorff distance (HD95), surface DSC, as well as relative added path length (RAPL) as an efficiency metric. A two-factor analysis of variance was used to quantify variability in contouring accuracy across software platforms (inter-software) and patients (inter-patient). Pairwise comparisons were performed to categorize the software into different performance groups, and inter-software variations (ISV) were calculated as the average performance differences between the groups.</p><p><strong>Results: </strong>Significant inter-software and inter-patient contouring accuracy variations (p<0.05) were observed for most OARs. The largest ISV in DSC in each anatomical region were cervical esophagus (0.41), trachea (0.10), spinal cord (0.13) and prostate (0.17). Among the organs evaluated, 7 had mean DSC >0.9 (i.e., heart, liver), 15 had DSC ranging from 0.7 to 0.89 (i.e., parotid, esophagus). The remaining organs (i.e., optic nerves, seminal vesicle) had DSC<0.7. 16 of the 31 organs (52%) had RAPL less than 0.1.</p><p><strong>Conclusion: </strong>Our results reveal significant inter-software and inter-patient variability in the performance of AI-segmentation software. These findings highlight the need of thorough software commissioning, testing, and quality assurance across disease sites, patient-specific anatomies and image acquisition protocols.</p>","PeriodicalId":54245,"journal":{"name":"Practical Radiation Oncology","volume":" ","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Practical Radiation Oncology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.prro.2025.06.012","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: To evaluate organs-at-risk (OARs) segmentation variability across eight commercial AI-based segmentation software using independent multi-institutional datasets, and to provide recommendations for clinical practices utilizing AI-segmentation.
Methods: 160 planning CT image sets from four anatomical sites: head-and-neck, thorax, abdomen and pelvis were retrospectively pooled from three institutions. Contours for 31 OARs generated by the software were compared to clinical contours using multiple accuracy metrics, including: Dice similarity coefficient (DSC), 95 Percentile of Hausdorff distance (HD95), surface DSC, as well as relative added path length (RAPL) as an efficiency metric. A two-factor analysis of variance was used to quantify variability in contouring accuracy across software platforms (inter-software) and patients (inter-patient). Pairwise comparisons were performed to categorize the software into different performance groups, and inter-software variations (ISV) were calculated as the average performance differences between the groups.
Results: Significant inter-software and inter-patient contouring accuracy variations (p<0.05) were observed for most OARs. The largest ISV in DSC in each anatomical region were cervical esophagus (0.41), trachea (0.10), spinal cord (0.13) and prostate (0.17). Among the organs evaluated, 7 had mean DSC >0.9 (i.e., heart, liver), 15 had DSC ranging from 0.7 to 0.89 (i.e., parotid, esophagus). The remaining organs (i.e., optic nerves, seminal vesicle) had DSC<0.7. 16 of the 31 organs (52%) had RAPL less than 0.1.
Conclusion: Our results reveal significant inter-software and inter-patient variability in the performance of AI-segmentation software. These findings highlight the need of thorough software commissioning, testing, and quality assurance across disease sites, patient-specific anatomies and image acquisition protocols.
期刊介绍:
The overarching mission of Practical Radiation Oncology is to improve the quality of radiation oncology practice. PRO''s purpose is to document the state of current practice, providing background for those in training and continuing education for practitioners, through discussion and illustration of new techniques, evaluation of current practices, and publication of case reports. PRO strives to provide its readers content that emphasizes knowledge "with a purpose." The content of PRO includes:
Original articles focusing on patient safety, quality measurement, or quality improvement initiatives
Original articles focusing on imaging, contouring, target delineation, simulation, treatment planning, immobilization, organ motion, and other practical issues
ASTRO guidelines, position papers, and consensus statements
Essays that highlight enriching personal experiences in caring for cancer patients and their families.