Evaluation and failure analysis of four commercial deep learning-based autosegmentation software for abdominal organs at risk

IF 2 4区医学 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Journal of Applied Clinical Medical Physics Pub Date : 2025-02-13 DOI:10.1002/acm2.70010

Mingdong Fan, Tonghe Wang, Yang Lei, Pretesh R. Patel, Sean Dresser, Beth Bradshaw Ghavidel, Richard L. J. Qiu, Jun Zhou, Kirk Luca, Oluwatosin Kayode, Jeffrey D. Bradley, Xiaofeng Yang, Justin Roper

{"title":"Evaluation and failure analysis of four commercial deep learning-based autosegmentation software for abdominal organs at risk","authors":"Mingdong Fan, Tonghe Wang, Yang Lei, Pretesh R. Patel, Sean Dresser, Beth Bradshaw Ghavidel, Richard L. J. Qiu, Jun Zhou, Kirk Luca, Oluwatosin Kayode, Jeffrey D. Bradley, Xiaofeng Yang, Justin Roper","doi":"10.1002/acm2.70010","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Purpose</h3>\n \n <p>Deep learning-based segmentation of organs-at-risk (OAR) is emerging to become mainstream in clinical practice because of the superior performance over atlas and model-based autocontouring methods. While several commercial deep learning-based autosegmentation solutions are now available, the implementation of these tools is still at such a primitive stage that acceptance criteria are underdeveloped due to a lack of knowledge about the systems’ segmentation tendencies and failure modes. As the starting point of the iterative process of clinical implementation, this study focuses on the outlier analysis of four commercial autocontouring tools for the abdominal OARs.</p>\n </section>\n \n <section>\n \n <h3> Materials and methods</h3>\n \n <p>The autosegmentation software, developed by Limbus AI, MIM Contour ProtégéAI, Radformation AutoContour, and Siemens syngo.via, were used to segment 111 patient cases. Geometric segmentation accuracy was quantitatively compared with clinical contours using the dice similarity coefficient (DSC) and 95% Hausdorff distance (HD95). The outliers from quantitative evaluations of each software were analyzed for the liver, stomach, and kidneys with the possible causes of outliers summarized into six categories: (1) difference in contouring style or guideline, (2) image acquisition and quality, (3) abnormal anatomy of the OAR, (4) abnormal anatomy of abutting organs/tissues, (5) external/internal devices, and (6) other causes.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>For the liver segmentation, the most prominent cause of discrepancies for Limbus, which occurred in four of its six outliers, was the existence of biliary stent or internal/external biliary drain as well as the resulting pneumobilia. Siemens included the abutting organs that shared CT numbers similar to those of the liver in 5/8 outliers. 12 of 13 Radformation's liver segmentation outliers included the heart and/or stomach while MIM not only included the stomach in the presence of barium in 5/11 outliers, but also produced fragmented contours in 5/11 other cases. Only Limbus and Radformation provided stomach segmentation, and imaging with barium contrast directly caused incomplete stomach delineation in 10/12 Limbus outliers and 21/25 Radformation outliers. As for the kidneys, Radformation and Siemens consistently followed the RTOG contouring guidelines, whereas the institutional contours excluded the renal pelvis in some cases, resulting in 19/25 Radformation outliers and 18/23 Siemens outliers. By contrast, Limbus contours appeared to follow different contouring guidelines that exclude the renal pelvis. Fragmented kidney contours were found in 10/15 Limbus outliers and 25/26 MIM outliers. The ones in MIM were directly linked to the use of IV contrast in imaging, but there was not enough evidence to identify the origin of Limbus's fragmented contours.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>The causes of the segmentation outliers of the four commercial deep learning-based autocontouring solutions were summarized for each OAR. This work can help the vendors improve their autosegmentation software and also inform the users of potential modes of failure when using the tools.</p>\n </section>\n </div>","PeriodicalId":14989,"journal":{"name":"Journal of Applied Clinical Medical Physics","volume":"26 4","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/acm2.70010","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Clinical Medical Physics","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/acm2.70010","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

Deep learning-based segmentation of organs-at-risk (OAR) is emerging to become mainstream in clinical practice because of the superior performance over atlas and model-based autocontouring methods. While several commercial deep learning-based autosegmentation solutions are now available, the implementation of these tools is still at such a primitive stage that acceptance criteria are underdeveloped due to a lack of knowledge about the systems’ segmentation tendencies and failure modes. As the starting point of the iterative process of clinical implementation, this study focuses on the outlier analysis of four commercial autocontouring tools for the abdominal OARs.

Materials and methods

The autosegmentation software, developed by Limbus AI, MIM Contour ProtégéAI, Radformation AutoContour, and Siemens syngo.via, were used to segment 111 patient cases. Geometric segmentation accuracy was quantitatively compared with clinical contours using the dice similarity coefficient (DSC) and 95% Hausdorff distance (HD95). The outliers from quantitative evaluations of each software were analyzed for the liver, stomach, and kidneys with the possible causes of outliers summarized into six categories: (1) difference in contouring style or guideline, (2) image acquisition and quality, (3) abnormal anatomy of the OAR, (4) abnormal anatomy of abutting organs/tissues, (5) external/internal devices, and (6) other causes.

Results

For the liver segmentation, the most prominent cause of discrepancies for Limbus, which occurred in four of its six outliers, was the existence of biliary stent or internal/external biliary drain as well as the resulting pneumobilia. Siemens included the abutting organs that shared CT numbers similar to those of the liver in 5/8 outliers. 12 of 13 Radformation's liver segmentation outliers included the heart and/or stomach while MIM not only included the stomach in the presence of barium in 5/11 outliers, but also produced fragmented contours in 5/11 other cases. Only Limbus and Radformation provided stomach segmentation, and imaging with barium contrast directly caused incomplete stomach delineation in 10/12 Limbus outliers and 21/25 Radformation outliers. As for the kidneys, Radformation and Siemens consistently followed the RTOG contouring guidelines, whereas the institutional contours excluded the renal pelvis in some cases, resulting in 19/25 Radformation outliers and 18/23 Siemens outliers. By contrast, Limbus contours appeared to follow different contouring guidelines that exclude the renal pelvis. Fragmented kidney contours were found in 10/15 Limbus outliers and 25/26 MIM outliers. The ones in MIM were directly linked to the use of IV contrast in imaging, but there was not enough evidence to identify the origin of Limbus's fragmented contours.

Conclusion

The causes of the segmentation outliers of the four commercial deep learning-based autocontouring solutions were summarized for each OAR. This work can help the vendors improve their autosegmentation software and also inform the users of potential modes of failure when using the tools.

Abstract Image

查看原文本刊更多论文

基于深度学习的四种商业腹部危险器官自动分割软件的评价与失效分析。

目的：基于深度学习的危险器官分割（OAR）由于其优于图谱和基于模型的自动轮廓方法，正在成为临床实践的主流。虽然现在有几种基于深度学习的商业自动分割解决方案，但由于缺乏对系统分割趋势和故障模式的了解，这些工具的实现仍然处于原始阶段，以至于接受标准不发达。作为临床实施迭代过程的起点，本研究侧重于四种商用腹部桨形自动轮廓工具的异常值分析。材料和方法：自动分割软件，由Limbus AI、MIM Contour prot -格- -格- - -格- - - i、Radformation AutoContour和Siemens syngo开发。通过，对111例患者进行了分节。采用骰子相似系数（DSC）和95% Hausdorff距离（HD95）定量比较几何分割精度与临床轮廓。分析每个软件定量评价的异常值对肝脏，胃和肾脏的影响，并将异常值的可能原因归纳为六类：(1)轮廓风格或指南的差异，(2)图像采集和质量，(3)OAR解剖异常，(4)邻近器官/组织解剖异常，(5)外部/内部设备，(6)其他原因。结果：对于肝分割，Limbus的差异最突出的原因是存在胆道支架或内/外胆道引流以及由此产生的气动，6个异常值中有4个出现了这种差异。西门子在5/8的异常值中纳入了与肝脏CT值相似的邻近器官。Radformation的13例肝脏分割异常值中有12例包括心脏和/或胃，而MIM的5/11例异常值不仅包括有钡存在的胃，而且在其他5/11例中也产生了碎片化的轮廓。只有Limbus和Radformation提供了胃的分割，在10/12的Limbus异常点和21/25的Radformation异常点中，钡造影剂成像直接导致胃的勾画不完整。至于肾脏，Radformation和Siemens始终遵循RTOG轮廓指南，而在某些情况下，机构轮廓排除了肾盂，导致19/25的Radformation异常值和18/23的Siemens异常值。相比之下，边缘轮廓似乎遵循不同的轮廓指南，排除肾盂。10/15 Limbus异常值和25/26 MIM异常值出现肾轮廓碎片化。MIM中的缺陷与成像中静脉造影剂的使用直接相关，但没有足够的证据来确定Limbus碎片轮廓的起源。结论：总结了4种基于深度学习的商用自动轮廓解决方案产生分割异常值的原因。这项工作可以帮助供应商改进他们的自动分割软件，并在使用工具时告知用户潜在的故障模式。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Applied Clinical Medical Physics 医学-核医学

CiteScore

3.60

自引率

19.00%

发文量

331

审稿时长

3 months

期刊介绍： Journal of Applied Clinical Medical Physics is an international Open Access publication dedicated to clinical medical physics. JACMP welcomes original contributions dealing with all aspects of medical physics from scientists working in the clinical medical physics around the world. JACMP accepts only online submission. JACMP will publish: -Original Contributions: Peer-reviewed, investigations that represent new and significant contributions to the field. Recommended word count: up to 7500. -Review Articles: Reviews of major areas or sub-areas in the field of clinical medical physics. These articles may be of any length and are peer reviewed. -Technical Notes: These should be no longer than 3000 words, including key references. -Letters to the Editor: Comments on papers published in JACMP or on any other matters of interest to clinical medical physics. These should not be more than 1250 (including the literature) and their publication is only based on the decision of the editor, who occasionally asks experts on the merit of the contents. -Book Reviews: The editorial office solicits Book Reviews. -Announcements of Forthcoming Meetings: The Editor may provide notice of forthcoming meetings, course offerings, and other events relevant to clinical medical physics. -Parallel Opposed Editorial: We welcome topics relevant to clinical practice and medical physics profession. The contents can be controversial debate or opposed aspects of an issue. One author argues for the position and the other against. Each side of the debate contains an opening statement up to 800 words, followed by a rebuttal up to 500 words. Readers interested in participating in this series should contact the moderator with a proposed title and a short description of the topic