Quantitative Evaluation of AI-based Organ Segmentation Across Multiple Anatomical Sites Using Eight Commercial Software Platforms.

IF 3.5 3区医学 Q2 ONCOLOGY

Practical Radiation Oncology Pub Date : 2025-08-23 DOI:10.1016/j.prro.2025.06.012

Lulin Yuan, Quan Chen, Hania Al-Hallaq, Jinzhong Yang, Xiaofeng Yang, Huaizhi Geng, Kujtim Latifi, Bin Cai, Qingrong Jackie Wu, Ying Xiao, Stanley H Benedict, Yi Rong, Jeff Buchsbaum, X Sharon Qi

{"title":"Quantitative Evaluation of AI-based Organ Segmentation Across Multiple Anatomical Sites Using Eight Commercial Software Platforms.","authors":"Lulin Yuan, Quan Chen, Hania Al-Hallaq, Jinzhong Yang, Xiaofeng Yang, Huaizhi Geng, Kujtim Latifi, Bin Cai, Qingrong Jackie Wu, Ying Xiao, Stanley H Benedict, Yi Rong, Jeff Buchsbaum, X Sharon Qi","doi":"10.1016/j.prro.2025.06.012","DOIUrl":null,"url":null,"abstract":"Purpose: To evaluate organs-at-risk (OARs) segmentation variability across eight commercial AI-based segmentation software using independent multi-institutional datasets, and to provide recommendations for clinical practices utilizing AI-segmentation.Methods: 160 planning CT image sets from four anatomical sites: head-and-neck, thorax, abdomen and pelvis were retrospectively pooled from three institutions. Contours for 31 OARs generated by the software were compared to clinical contours using multiple accuracy metrics, including: Dice similarity coefficient (DSC), 95 Percentile of Hausdorff distance (HD95), surface DSC, as well as relative added path length (RAPL) as an efficiency metric. A two-factor analysis of variance was used to quantify variability in contouring accuracy across software platforms (inter-software) and patients (inter-patient). Pairwise comparisons were performed to categorize the software into different performance groups, and inter-software variations (ISV) were calculated as the average performance differences between the groups.Results: Significant inter-software and inter-patient contouring accuracy variations (p<0.05) were observed for most OARs. The largest ISV in DSC in each anatomical region were cervical esophagus (0.41), trachea (0.10), spinal cord (0.13) and prostate (0.17). Among the organs evaluated, 7 had mean DSC >0.9 (i.e., heart, liver), 15 had DSC ranging from 0.7 to 0.89 (i.e., parotid, esophagus). The remaining organs (i.e., optic nerves, seminal vesicle) had DSC<0.7. 16 of the 31 organs (52%) had RAPL less than 0.1.Conclusion: Our results reveal significant inter-software and inter-patient variability in the performance of AI-segmentation software. These findings highlight the need of thorough software commissioning, testing, and quality assurance across disease sites, patient-specific anatomies and image acquisition protocols.","PeriodicalId":54245,"journal":{"name":"Practical Radiation Oncology","volume":" ","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Practical Radiation Oncology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.prro.2025.06.012","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: To evaluate organs-at-risk (OARs) segmentation variability across eight commercial AI-based segmentation software using independent multi-institutional datasets, and to provide recommendations for clinical practices utilizing AI-segmentation.

Methods: 160 planning CT image sets from four anatomical sites: head-and-neck, thorax, abdomen and pelvis were retrospectively pooled from three institutions. Contours for 31 OARs generated by the software were compared to clinical contours using multiple accuracy metrics, including: Dice similarity coefficient (DSC), 95 Percentile of Hausdorff distance (HD95), surface DSC, as well as relative added path length (RAPL) as an efficiency metric. A two-factor analysis of variance was used to quantify variability in contouring accuracy across software platforms (inter-software) and patients (inter-patient). Pairwise comparisons were performed to categorize the software into different performance groups, and inter-software variations (ISV) were calculated as the average performance differences between the groups.

Results: Significant inter-software and inter-patient contouring accuracy variations (p<0.05) were observed for most OARs. The largest ISV in DSC in each anatomical region were cervical esophagus (0.41), trachea (0.10), spinal cord (0.13) and prostate (0.17). Among the organs evaluated, 7 had mean DSC >0.9 (i.e., heart, liver), 15 had DSC ranging from 0.7 to 0.89 (i.e., parotid, esophagus). The remaining organs (i.e., optic nerves, seminal vesicle) had DSC<0.7. 16 of the 31 organs (52%) had RAPL less than 0.1.

Conclusion: Our results reveal significant inter-software and inter-patient variability in the performance of AI-segmentation software. These findings highlight the need of thorough software commissioning, testing, and quality assurance across disease sites, patient-specific anatomies and image acquisition protocols.

查看原文本刊更多论文

利用8个商业软件平台对基于人工智能的多解剖部位器官分割进行定量评价

目的：利用独立的多机构数据集，评估8种基于人工智能的商业分割软件的高危器官（OARs）分割变异性，并为利用人工智能分割的临床实践提供建议。方法：回顾性收集3所医院头颈、胸、腹、盆4个解剖部位的160组规划CT图像。将软件生成的31个桨的轮廓与临床轮廓进行比较，使用多个精度指标，包括：Dice相似系数（DSC）， Hausdorff距离95百分位数（HD95），表面DSC以及作为效率指标的相对附加路径长度（RAPL）。采用双因素方差分析来量化跨软件平台（软件间）和患者（患者间）轮廓精度的可变性。两两比较将软件分为不同的性能组，并计算软件间差异（ISV）作为组间的平均性能差异。结果：软件间和患者间轮廓精度差异显著（p0.9（即心脏、肝脏）），15例DSC范围为0.7至0.89（即腮腺、食道）。结论：我们的研究结果揭示了人工智能分割软件的性能在软件之间和患者之间存在显著的差异。这些发现强调了跨疾病部位、患者特定解剖结构和图像采集方案进行全面的软件调试、测试和质量保证的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Practical Radiation Oncology Medicine-Radiology, Nuclear Medicine and Imaging

CiteScore

5.20

自引率

6.10%

发文量

177

审稿时长

34 days

期刊介绍： The overarching mission of Practical Radiation Oncology is to improve the quality of radiation oncology practice. PRO''s purpose is to document the state of current practice, providing background for those in training and continuing education for practitioners, through discussion and illustration of new techniques, evaluation of current practices, and publication of case reports. PRO strives to provide its readers content that emphasizes knowledge "with a purpose." The content of PRO includes: Original articles focusing on patient safety, quality measurement, or quality improvement initiatives Original articles focusing on imaging, contouring, target delineation, simulation, treatment planning, immobilization, organ motion, and other practical issues ASTRO guidelines, position papers, and consensus statements Essays that highlight enriching personal experiences in caring for cancer patients and their families.