Richard T. Carrick MD, PhD , Ethan J. Rowin MD , Alessio Gasperetti MD, PhD , Christopher Madias MD , Martin Maron MD , Katherine C. Wu MD
{"title":"Enhancing explainability in clinical deep-learning models: Latent-space variable decoding is superior to gradient-weighted class activation mapping","authors":"Richard T. Carrick MD, PhD , Ethan J. Rowin MD , Alessio Gasperetti MD, PhD , Christopher Madias MD , Martin Maron MD , Katherine C. Wu MD","doi":"10.1016/j.hroo.2025.06.022","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Deep-learning models designed to assist with clinical decision making abound in cardiology. However, the “black box” nature of these models limits physicians’ ability to use them to cross-check clinical gestalt when evaluating model predictions. Analytical techniques such as the popular gradient-weighted class activation mapping (Grad-CAM) may provide insight into model explainability, but the reliability and reproducibility of these techniques have not been studied.</div></div><div><h3>Objective</h3><div>To perform a rigorous assessment of the explainability offered by Grad-CAM, with comparison to alternative saliency methods provided by intrinsicly explainable deep-learning models.</div></div><div><h3>Methods</h3><div>We examined a well-phenotyped cohort of 1930 patients with hypertrophic cardiomyopathy (HCM) and available electrocardiographic waveform data. Novel deep-learning models were developed for the prediction of 2 high-risk HCM features: left ventricular (LV) apical aneurysm and massive LV hypertrophy. Saliency analysis was performed using (1) Grad-CAM and (2) latent-space variable decoding (LSVD).</div></div><div><h3>Results</h3><div>Deep-learning models amenable to Grad-CAM– and LSVD-based saliency analysis demonstrated comparable performances in the identification of LV apical aneurysm (C statistic 0.95 vs 0.93) and massive LV hypertrophy (C statistic 0.82 vs 0.83) during holdout testing. However, while Grad-CAM produced highly variable visual assessments of model attention and offered little insight into the models’ underlying decision-making processes, LSVD allowed direct visualization of those electrocardiographic characteristics that differentiated patients with and without the high-risk HCM features of interest. In addition, Kolmogorov-Smirnov goodness-of-fit testing of latent-space variables offered a method for prospectively assessing the likelihood of deep-learning model overfitting.</div></div><div><h3>Conclusion</h3><div>Deep-learning models amenable to LSVD analysis offered more robust explainability than did models amenable to the popular Grad-CAM analytical technique while offering comparable predictive performance.</div></div>","PeriodicalId":29772,"journal":{"name":"Heart Rhythm O2","volume":"6 9","pages":"Pages 1248-1258"},"PeriodicalIF":2.9000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Heart Rhythm O2","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666501825002417","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Background
Deep-learning models designed to assist with clinical decision making abound in cardiology. However, the “black box” nature of these models limits physicians’ ability to use them to cross-check clinical gestalt when evaluating model predictions. Analytical techniques such as the popular gradient-weighted class activation mapping (Grad-CAM) may provide insight into model explainability, but the reliability and reproducibility of these techniques have not been studied.
Objective
To perform a rigorous assessment of the explainability offered by Grad-CAM, with comparison to alternative saliency methods provided by intrinsicly explainable deep-learning models.
Methods
We examined a well-phenotyped cohort of 1930 patients with hypertrophic cardiomyopathy (HCM) and available electrocardiographic waveform data. Novel deep-learning models were developed for the prediction of 2 high-risk HCM features: left ventricular (LV) apical aneurysm and massive LV hypertrophy. Saliency analysis was performed using (1) Grad-CAM and (2) latent-space variable decoding (LSVD).
Results
Deep-learning models amenable to Grad-CAM– and LSVD-based saliency analysis demonstrated comparable performances in the identification of LV apical aneurysm (C statistic 0.95 vs 0.93) and massive LV hypertrophy (C statistic 0.82 vs 0.83) during holdout testing. However, while Grad-CAM produced highly variable visual assessments of model attention and offered little insight into the models’ underlying decision-making processes, LSVD allowed direct visualization of those electrocardiographic characteristics that differentiated patients with and without the high-risk HCM features of interest. In addition, Kolmogorov-Smirnov goodness-of-fit testing of latent-space variables offered a method for prospectively assessing the likelihood of deep-learning model overfitting.
Conclusion
Deep-learning models amenable to LSVD analysis offered more robust explainability than did models amenable to the popular Grad-CAM analytical technique while offering comparable predictive performance.
旨在协助临床决策的深度学习模型在心脏病学中比比皆是。然而,这些模型的“黑箱”性质限制了医生在评估模型预测时使用它们来交叉检查临床完形的能力。分析技术,如流行的梯度加权类激活映射(Grad-CAM)可以提供对模型可解释性的见解,但这些技术的可靠性和可重复性尚未得到研究。目的对Grad-CAM提供的可解释性进行严格评估,并与可内在解释的深度学习模型提供的其他显著性方法进行比较。方法对1930例肥厚性心肌病(HCM)患者和现有的心电图波形资料进行了表型良好的队列研究。我们开发了新的深度学习模型,用于预测2种高危HCM特征:左室(LV)顶端动脉瘤和大量左室肥大。使用(1)Grad-CAM和(2)潜空间变量解码(LSVD)进行显著性分析。结果深度学习模型适用于基于Grad-CAM和lsvd的显著性分析,在识别左室顶动脉瘤(C统计量0.95 vs 0.93)和大量左室肥大(C统计量0.82 vs 0.83)方面表现出相当的性能。然而,尽管Grad-CAM对模型注意力进行了高度可变的视觉评估,并且对模型的潜在决策过程提供了很少的见解,但LSVD允许直接可视化那些区分有和没有高危HCM特征的患者的心电图特征。此外,潜在空间变量的Kolmogorov-Smirnov拟合优度检验提供了一种前瞻性评估深度学习模型过拟合可能性的方法。适用于LSVD分析的深度学习模型比适用于流行的Grad-CAM分析技术的模型具有更强的可解释性,同时提供可比较的预测性能。