Unveiling Illusionary Robust Features: A Novel Approach for Adversarial Defenses in Deep Neural Networks

IF 3.6 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Access Pub Date : 2025-09-01 DOI:10.1109/ACCESS.2025.3604636

Alireza Aghabagherloo;Rafa Gálvez;Davy Preuveneers;Bart Preneel

{"title":"Unveiling Illusionary Robust Features: A Novel Approach for Adversarial Defenses in Deep Neural Networks","authors":"Alireza Aghabagherloo;Rafa Gálvez;Davy Preuveneers;Bart Preneel","doi":"10.1109/ACCESS.2025.3604636","DOIUrl":null,"url":null,"abstract":"Deep Neural Networks (DNNs) are vulnerable to visually imperceptible perturbations, known as Adversarial Examples (AEs). The leading hypothesis attributes this susceptibility to “non-robust features,” which are highly predictive but fragile. Recent studies have challenged the robustness of models trained on robust features. One study demonstrates that models trained on robust features are vulnerable to AutoAttack in cross-paradigm settings. Another study showing the susceptibility of robust models to attacks based on Projected Gradient Descent (PGD) when attackers have complete knowledge of the robust model suggests the presence of “illusionary robust features”—robust features highly correlated with incorrect labels—as the root cause of this vulnerability. These findings complicate the analysis of DNNs’ robustness and reveal limitations, without offering concrete solutions. This paper extends previous works by reevaluating the susceptibility of the “robust model” to AutoAttack. Considering “illusionary robust features” as the root cause of this susceptibility, we propose a novel robustification algorithm that generates a “purified robust dataset”. This robustification method not only nullifies the effect of features weakly correlated with correct labels (non-robust features) but also features highly correlated with incorrect labels (illusionary robust features). We evaluated the robustness of the models trained on “standard”, “robust”, and “purified robust” datasets against various strategies based on state-of-the-art AutoAttack and PGD attacks. These evaluations resulted in a better understanding of how the presence of “non-robust” and “illusionary robust” features in datasets and classifiers and their entanglements can result in the susceptibility of DNNs. Our experiment also shows that employing our robustification method, which filters out the effect of “non-robust” and “illusionary robust” features in both train and test sets, effectively addresses the vulnerabilities of DNNs, regardless of the mentioned entanglements. The contributions of this paper advance the understanding of DNN vulnerabilities and provide a more robust solution against sophisticated adversarial attacks.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"154678-154694"},"PeriodicalIF":3.6000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11145438","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11145438/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Deep Neural Networks (DNNs) are vulnerable to visually imperceptible perturbations, known as Adversarial Examples (AEs). The leading hypothesis attributes this susceptibility to “non-robust features,” which are highly predictive but fragile. Recent studies have challenged the robustness of models trained on robust features. One study demonstrates that models trained on robust features are vulnerable to AutoAttack in cross-paradigm settings. Another study showing the susceptibility of robust models to attacks based on Projected Gradient Descent (PGD) when attackers have complete knowledge of the robust model suggests the presence of “illusionary robust features”—robust features highly correlated with incorrect labels—as the root cause of this vulnerability. These findings complicate the analysis of DNNs’ robustness and reveal limitations, without offering concrete solutions. This paper extends previous works by reevaluating the susceptibility of the “robust model” to AutoAttack. Considering “illusionary robust features” as the root cause of this susceptibility, we propose a novel robustification algorithm that generates a “purified robust dataset”. This robustification method not only nullifies the effect of features weakly correlated with correct labels (non-robust features) but also features highly correlated with incorrect labels (illusionary robust features). We evaluated the robustness of the models trained on “standard”, “robust”, and “purified robust” datasets against various strategies based on state-of-the-art AutoAttack and PGD attacks. These evaluations resulted in a better understanding of how the presence of “non-robust” and “illusionary robust” features in datasets and classifiers and their entanglements can result in the susceptibility of DNNs. Our experiment also shows that employing our robustification method, which filters out the effect of “non-robust” and “illusionary robust” features in both train and test sets, effectively addresses the vulnerabilities of DNNs, regardless of the mentioned entanglements. The contributions of this paper advance the understanding of DNN vulnerabilities and provide a more robust solution against sophisticated adversarial attacks.

查看原文本刊更多论文

揭示虚幻的鲁棒特征：深度神经网络对抗防御的新方法

深度神经网络（dnn）容易受到视觉上难以察觉的扰动，即对抗性示例（AEs）。主要的假设将这种易感性归因于“非稳健特征”，这种特征具有高度预测性，但很脆弱。最近的研究对基于鲁棒特征训练的模型的鲁棒性提出了挑战。一项研究表明，在鲁棒特征上训练的模型在跨范式设置中容易受到自动攻击。另一项研究表明，当攻击者完全了解鲁棒模型时，基于投影梯度下降（PGD）的鲁棒模型对攻击的易感性表明，存在“虚幻的鲁棒特征”——与错误标签高度相关的鲁棒特征——是这种漏洞的根本原因。这些发现使dnn的鲁棒性分析变得复杂，并揭示了局限性，但没有提供具体的解决方案。本文通过重新评估“鲁棒模型”对自动攻击的敏感性来扩展先前的工作。考虑到“虚幻鲁棒特征”是这种易感性的根本原因，我们提出了一种新的鲁棒化算法，该算法生成“纯化鲁棒数据集”。这种鲁棒化方法不仅消除了与正确标签弱相关的特征（非鲁棒特征）的影响，而且消除了与错误标签高度相关的特征（虚幻鲁棒特征）的影响。我们评估了在“标准”、“鲁棒”和“纯化鲁棒”数据集上训练的模型对基于最先进的自动攻击和PGD攻击的各种策略的鲁棒性。这些评估有助于更好地理解数据集和分类器中“非鲁棒”和“虚幻鲁棒”特征的存在及其纠缠如何导致dnn的易感性。我们的实验还表明，采用我们的鲁棒化方法，过滤掉训练集和测试集中“非鲁棒”和“虚幻鲁棒”特征的影响，有效地解决了dnn的漏洞，而不考虑上述纠缠。本文的贡献促进了对深度神经网络漏洞的理解，并提供了针对复杂对抗性攻击的更强大的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

9.80

自引率

7.70%

发文量

6673

审稿时长

6 weeks

期刊介绍： IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.