OSAIRIS: Lessons Learned From the Hospital-Based Implementation and Evaluation of an Open-Source Deep-Learning Model for Radiotherapy Image Segmentation

IF 3.2 3区医学 Q2 ONCOLOGY

Clinical oncology Pub Date : 2024-10-18 DOI:10.1016/j.clon.2024.10.032

A.D. Constantinou , A. Hoole , D.C. Wong , G.S. Sagoo , J. Alvarez-Valle , K. Takeda , T. Griffiths , A. Edwards , A. Robinson , L. Stubbington , N. Bolger , Y. Rimmer , T. Elumalai , K.T. Jayaprakash , R. Benson , I. Gleeson , R. Sen , L. Stockton , T. Wang , S. Brown , R. Jena

{"title":"OSAIRIS: Lessons Learned From the Hospital-Based Implementation and Evaluation of an Open-Source Deep-Learning Model for Radiotherapy Image Segmentation","authors":"A.D. Constantinou , A. Hoole , D.C. Wong , G.S. Sagoo , J. Alvarez-Valle , K. Takeda , T. Griffiths , A. Edwards , A. Robinson , L. Stubbington , N. Bolger , Y. Rimmer , T. Elumalai , K.T. Jayaprakash , R. Benson , I. Gleeson , R. Sen , L. Stockton , T. Wang , S. Brown , R. Jena","doi":"10.1016/j.clon.2024.10.032","DOIUrl":null,"url":null,"abstract":"<div><div>Several studies report the benefits and accuracy of using autosegmentation for organ at risk (OAR) outlining in radiotherapy treatment planning. Typically, evaluations focus on accuracy metrics, and other parameters such as perceived utility and safety are routinely ignored. Here, we report our finding from the implementation and clinical evaluation of OSAIRIS, an open-source AI model for radiotherapy image segmentation that was carried out as part of its development into a medical device. The device contours OARs in the head and neck and male pelvis (referred to as the prostate model), and is designed to be used as a time-saving workflow device, alongside a clinician. Unlike standard evaluation processes, which heavily rely on accuracy metrics alone, our evaluation sought to demonstrate the tangible benefits, quantify utility and assess risk within a specific clinical workflow. We evaluated the time-saving benefit this device affords to clinicians, and how this time-saving might be linked to accuracy metrics, as well as the clinicians' assessment of the usability of the OSAIRIS contours in comparison to their colleagues' contours and those from other commercial AI contouring devices. Our safety evaluation focused on whether clinicians can notice and correct any errors should they be included in the output of the device.</div><div>We found that OSAIRIS affords a significant time-saving of 36% (5.4 ± 2.1 minutes) when used for prostate contouring and 67% (30.3 ± 8.7 minutes) for head and neck contouring. Combining editing time data with accuracy metrics, we found the Hausdorff distance best correlated with editing-time, outperforming dice, the industry-standard, with a Spearman correlation coefficient of 0.70, and a Kendall coefficient of 0.52. Our safety and risk-mitigation exercise showed that anchoring bias is present when clinicians edit AI-generated contours, with the effect seemingly more pronounced for some structures over others. Most errors, however, were corrected by clinicians, with 72% of the head and neck errors 81% of the prostate errors removed in the editing step. Notably, our blinded clinician contour rating exercise showed that gold standard clinician contours are not rated more highly than the AI-generated contours.</div><div>We conclude that evaluations of AI in a clinical setting must consider the clinical workflow in which the device will be used, and not rely on accuracy metrics alone, in order to reliably assess the benefits, utility and safety of the device. The effects of human-AI inter-operation must be evaluated to accurately assess the practical usability and potential uptake of the technology, as demonstrated in our blinded clinical utility review. The clinical risks posed by the use of the device must be studied and mitigated as far as possible, and our ‘Mystery Shopping’ experiment provides a template for future such assessments.</div></div>","PeriodicalId":10403,"journal":{"name":"Clinical oncology","volume":"37 ","pages":"Article 103660"},"PeriodicalIF":3.2000,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical oncology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S093665552400445X","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Several studies report the benefits and accuracy of using autosegmentation for organ at risk (OAR) outlining in radiotherapy treatment planning. Typically, evaluations focus on accuracy metrics, and other parameters such as perceived utility and safety are routinely ignored. Here, we report our finding from the implementation and clinical evaluation of OSAIRIS, an open-source AI model for radiotherapy image segmentation that was carried out as part of its development into a medical device. The device contours OARs in the head and neck and male pelvis (referred to as the prostate model), and is designed to be used as a time-saving workflow device, alongside a clinician. Unlike standard evaluation processes, which heavily rely on accuracy metrics alone, our evaluation sought to demonstrate the tangible benefits, quantify utility and assess risk within a specific clinical workflow. We evaluated the time-saving benefit this device affords to clinicians, and how this time-saving might be linked to accuracy metrics, as well as the clinicians' assessment of the usability of the OSAIRIS contours in comparison to their colleagues' contours and those from other commercial AI contouring devices. Our safety evaluation focused on whether clinicians can notice and correct any errors should they be included in the output of the device.

We found that OSAIRIS affords a significant time-saving of 36% (5.4 ± 2.1 minutes) when used for prostate contouring and 67% (30.3 ± 8.7 minutes) for head and neck contouring. Combining editing time data with accuracy metrics, we found the Hausdorff distance best correlated with editing-time, outperforming dice, the industry-standard, with a Spearman correlation coefficient of 0.70, and a Kendall coefficient of 0.52. Our safety and risk-mitigation exercise showed that anchoring bias is present when clinicians edit AI-generated contours, with the effect seemingly more pronounced for some structures over others. Most errors, however, were corrected by clinicians, with 72% of the head and neck errors 81% of the prostate errors removed in the editing step. Notably, our blinded clinician contour rating exercise showed that gold standard clinician contours are not rated more highly than the AI-generated contours.

We conclude that evaluations of AI in a clinical setting must consider the clinical workflow in which the device will be used, and not rely on accuracy metrics alone, in order to reliably assess the benefits, utility and safety of the device. The effects of human-AI inter-operation must be evaluated to accurately assess the practical usability and potential uptake of the technology, as demonstrated in our blinded clinical utility review. The clinical risks posed by the use of the device must be studied and mitigated as far as possible, and our ‘Mystery Shopping’ experiment provides a template for future such assessments.

查看原文本刊更多论文

OSAIRIS：从医院实施和评估用于放射治疗图像分割的开源深度学习模型中汲取的经验教训。

有几项研究报告了在放射治疗计划中使用自动分割技术勾画危险器官（OAR）的好处和准确性。通常情况下，评估主要集中在准确性指标上，而其他参数，如感知效用和安全性通常会被忽略。在此，我们报告 OSAIRIS 的实施和临床评估结果。OSAIRIS 是一种用于放射治疗图像分割的开源人工智能模型，是其开发成医疗设备的一部分。该设备对头颈部和男性骨盆（称为前列腺模型）的 OAR 进行了轮廓分析，旨在与临床医生一起将其用作节省时间的工作流程设备。与严重依赖准确度指标的标准评估流程不同，我们的评估试图在特定的临床工作流程中展示切实的益处、量化实用性并评估风险。我们评估了该设备为临床医生带来的省时优势，以及这种省时优势如何与准确性指标相关联，还评估了临床医生对 OSAIRIS 轮廓的可用性的评价，并与他们同事的轮廓和其他商用人工智能轮廓设备的轮廓进行了比较。我们的安全性评估侧重于临床医生是否能注意到并纠正设备输出中的任何错误。我们发现，OSAIRIS 用于前列腺轮廓分析可节省 36% 的时间（5.4 ± 2.1 分钟），用于头颈部轮廓分析可节省 67% 的时间（30.3 ± 8.7 分钟）。将编辑时间数据与准确度指标相结合，我们发现豪斯多夫距离与编辑时间的相关性最好，优于行业标准骰子，斯皮尔曼相关系数为 0.70，肯德尔系数为 0.52。我们的安全和风险缓解工作表明，临床医生在编辑人工智能生成的轮廓时，会出现锚定偏差，而且对某些结构的影响似乎比其他结构更明显。不过，大多数错误都被临床医生纠正了，72% 的头颈部错误和 81% 的前列腺错误在编辑步骤中被消除。值得注意的是，我们的临床医生盲法轮廓评分练习显示，金标准临床医生轮廓评分并不比人工智能生成的轮廓评分高。我们的结论是，在临床环境中对人工智能进行评估时，必须考虑该设备的临床工作流程，而不能仅仅依赖于准确性指标，这样才能可靠地评估该设备的益处、实用性和安全性。正如我们的盲法临床实用性审查所证明的那样，必须评估人类与人工智能相互操作的效果，以准确评估该技术的实际可用性和潜在吸收率。必须研究并尽可能降低使用该设备所带来的临床风险，我们的 "神秘购物 "实验为未来的此类评估提供了一个模板。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Clinical oncology 医学-肿瘤学

CiteScore

5.20

自引率

8.80%

发文量

332

审稿时长

40 days

期刊介绍： Clinical Oncology is an International cancer journal covering all aspects of the clinical management of cancer patients, reflecting a multidisciplinary approach to therapy. Papers, editorials and reviews are published on all types of malignant disease embracing, pathology, diagnosis and treatment, including radiotherapy, chemotherapy, surgery, combined modality treatment and palliative care. Research and review papers covering epidemiology, radiobiology, radiation physics, tumour biology, and immunology are also published, together with letters to the editor, case reports and book reviews.