Evaluating the Robustness of a Deep Learning Bone Age Algorithm to Clinical Image Variation Using Computational Stress Testing.

IF 8.1 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Radiology-Artificial Intelligence Pub Date : 2024-05-01 DOI:10.1148/ryai.230240

Samantha M Santomartino, Kristin Putman, Elham Beheshtian, Vishwa S Parekh, Paul H Yi

{"title":"Evaluating the Robustness of a Deep Learning Bone Age Algorithm to Clinical Image Variation Using Computational Stress Testing.","authors":"Samantha M Santomartino, Kristin Putman, Elham Beheshtian, Vishwa S Parekh, Paul H Yi","doi":"10.1148/ryai.230240","DOIUrl":null,"url":null,"abstract":"Purpose To evaluate the robustness of an award-winning bone age deep learning (DL) model to extensive variations in image appearance. Materials and Methods In December 2021, the DL bone age model that won the 2017 RSNA Pediatric Bone Age Challenge was retrospectively evaluated using the RSNA validation set (1425 pediatric hand radiographs; internal test set in this study) and the Digital Hand Atlas (DHA) (1202 pediatric hand radiographs; external test set). Each test image underwent seven types of transformations (rotations, flips, brightness, contrast, inversion, laterality marker, and resolution) to represent a range of image appearances, many of which simulate real-world variations. Computational \"stress tests\" were performed by comparing the model's predictions on baseline and transformed images. Mean absolute differences (MADs) of predicted bone ages compared with radiologist-determined ground truth on baseline versus transformed images were compared using Wilcoxon signed rank tests. The proportion of clinically significant errors (CSEs) was compared using McNemar tests. Results There was no evidence of a difference in MAD of the model on the two baseline test sets (RSNA = 6.8 months, DHA = 6.9 months; P = .05), indicating good model generalization to external data. Except for the RSNA dataset images with an appended radiologic laterality marker (P = .86), there were significant differences in MAD for both the DHA and RSNA datasets among other transformation groups (rotations, flips, brightness, contrast, inversion, and resolution). There were significant differences in proportion of CSEs for 57% of the image transformations (19 of 33) performed on the DHA dataset. Conclusion Although an award-winning pediatric bone age DL model generalized well to curated external images, it had inconsistent predictions on images that had undergone simple transformations reflective of several real-world variations in image appearance. Keywords: Pediatrics, Hand, Convolutional Neural Network, Radiography Supplemental material is available for this article. © RSNA, 2024 See also commentary by Faghani and Erickson in this issue.","PeriodicalId":29787,"journal":{"name":"Radiology-Artificial Intelligence","volume":" ","pages":"e230240"},"PeriodicalIF":8.1000,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11140516/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology-Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1148/ryai.230240","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose To evaluate the robustness of an award-winning bone age deep learning (DL) model to extensive variations in image appearance. Materials and Methods In December 2021, the DL bone age model that won the 2017 RSNA Pediatric Bone Age Challenge was retrospectively evaluated using the RSNA validation set (1425 pediatric hand radiographs; internal test set in this study) and the Digital Hand Atlas (DHA) (1202 pediatric hand radiographs; external test set). Each test image underwent seven types of transformations (rotations, flips, brightness, contrast, inversion, laterality marker, and resolution) to represent a range of image appearances, many of which simulate real-world variations. Computational "stress tests" were performed by comparing the model's predictions on baseline and transformed images. Mean absolute differences (MADs) of predicted bone ages compared with radiologist-determined ground truth on baseline versus transformed images were compared using Wilcoxon signed rank tests. The proportion of clinically significant errors (CSEs) was compared using McNemar tests. Results There was no evidence of a difference in MAD of the model on the two baseline test sets (RSNA = 6.8 months, DHA = 6.9 months; P = .05), indicating good model generalization to external data. Except for the RSNA dataset images with an appended radiologic laterality marker (P = .86), there were significant differences in MAD for both the DHA and RSNA datasets among other transformation groups (rotations, flips, brightness, contrast, inversion, and resolution). There were significant differences in proportion of CSEs for 57% of the image transformations (19 of 33) performed on the DHA dataset. Conclusion Although an award-winning pediatric bone age DL model generalized well to curated external images, it had inconsistent predictions on images that had undergone simple transformations reflective of several real-world variations in image appearance. Keywords: Pediatrics, Hand, Convolutional Neural Network, Radiography Supplemental material is available for this article. © RSNA, 2024 See also commentary by Faghani and Erickson in this issue.

查看原文本刊更多论文

利用计算压力测试评估深度学习骨龄算法对临床图像变化的鲁棒性。

"刚刚接受 "的论文经过同行评审，已被接受在《放射学》上发表：人工智能》上发表。这篇文章在以最终版本发表之前，还将经过校对、排版和校对审核。请注意，在制作最终校对稿的过程中，可能会发现影响文章内容的错误。目的评估获奖的骨龄深度学习（DL）模型对图像外观的广泛变化的鲁棒性。材料与方法 2021 年 12 月，使用北美放射学会（RSNA）验证集（n = 1425 张小儿手部放射照片；内部测试集）和数字手图集（DHA；n = 1202 张小儿手部放射照片；外部测试集）对赢得 2017 年 RSNA 小儿骨龄挑战赛的 DL 骨龄模型进行了回顾性评估。每张测试图像都经过七种类型的转换（旋转、翻转、亮度、对比度、反转、侧位标记和分辨率），以代表一系列图像外观，其中许多是模拟真实世界的变化。通过比较模型对基线图像和转换图像的预测，进行了计算 "压力测试"。使用 Wilcoxon Signed Rank 检验比较了基线图像和转换图像上预测骨龄与放射科医生确定的基本真实值的平均绝对差值（MAD）。使用 McNemar 检验比较有临床意义的误差 (CSE) 比例。结果在两个基线测试集（RSNA = 6.8，DHA = 6.9；P = .05）上，没有证据表明模型的 MAD 存在差异，这表明模型对外部数据具有良好的泛化能力。除了带有附加放射学侧位标记（P = .86）的 RSNA 图像外，DHA 和 RSNA 数据集的 MAD 在其他转换组（旋转、翻转、亮度、对比度、反转和分辨率）之间存在显著差异。在对 DHA 数据集进行的图像转换中，57.6%（19/33）的 CSE 比例存在明显差异。结论尽管获奖的小儿骨龄 DL 模型对经过策划的外部图像具有良好的通用性，但它对经过简单转换的图像的预测不一致，而这些转换反映了图像外观的几种真实世界的变化。©RSNA, 2024.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Radiology-Artificial Intelligence

CiteScore

16.20

自引率

1.00%

发文量

期刊介绍： Radiology: Artificial Intelligence is a bi-monthly publication that focuses on the emerging applications of machine learning and artificial intelligence in the field of imaging across various disciplines. This journal is available online and accepts multiple manuscript types, including Original Research, Technical Developments, Data Resources, Review articles, Editorials, Letters to the Editor and Replies, Special Reports, and AI in Brief.