{"title":"揭示虚幻的鲁棒特征:深度神经网络对抗防御的新方法","authors":"Alireza Aghabagherloo;Rafa Gálvez;Davy Preuveneers;Bart Preneel","doi":"10.1109/ACCESS.2025.3604636","DOIUrl":null,"url":null,"abstract":"Deep Neural Networks (DNNs) are vulnerable to visually imperceptible perturbations, known as Adversarial Examples (AEs). The leading hypothesis attributes this susceptibility to “non-robust features,” which are highly predictive but fragile. Recent studies have challenged the robustness of models trained on robust features. One study demonstrates that models trained on robust features are vulnerable to AutoAttack in cross-paradigm settings. Another study showing the susceptibility of robust models to attacks based on Projected Gradient Descent (PGD) when attackers have complete knowledge of the robust model suggests the presence of “illusionary robust features”—robust features highly correlated with incorrect labels—as the root cause of this vulnerability. These findings complicate the analysis of DNNs’ robustness and reveal limitations, without offering concrete solutions. This paper extends previous works by reevaluating the susceptibility of the “robust model” to AutoAttack. Considering “illusionary robust features” as the root cause of this susceptibility, we propose a novel robustification algorithm that generates a “purified robust dataset”. This robustification method not only nullifies the effect of features weakly correlated with correct labels (non-robust features) but also features highly correlated with incorrect labels (illusionary robust features). We evaluated the robustness of the models trained on “standard”, “robust”, and “purified robust” datasets against various strategies based on state-of-the-art AutoAttack and PGD attacks. These evaluations resulted in a better understanding of how the presence of “non-robust” and “illusionary robust” features in datasets and classifiers and their entanglements can result in the susceptibility of DNNs. Our experiment also shows that employing our robustification method, which filters out the effect of “non-robust” and “illusionary robust” features in both train and test sets, effectively addresses the vulnerabilities of DNNs, regardless of the mentioned entanglements. The contributions of this paper advance the understanding of DNN vulnerabilities and provide a more robust solution against sophisticated adversarial attacks.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"154678-154694"},"PeriodicalIF":3.6000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11145438","citationCount":"0","resultStr":"{\"title\":\"Unveiling Illusionary Robust Features: A Novel Approach for Adversarial Defenses in Deep Neural Networks\",\"authors\":\"Alireza Aghabagherloo;Rafa Gálvez;Davy Preuveneers;Bart Preneel\",\"doi\":\"10.1109/ACCESS.2025.3604636\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep Neural Networks (DNNs) are vulnerable to visually imperceptible perturbations, known as Adversarial Examples (AEs). The leading hypothesis attributes this susceptibility to “non-robust features,” which are highly predictive but fragile. Recent studies have challenged the robustness of models trained on robust features. One study demonstrates that models trained on robust features are vulnerable to AutoAttack in cross-paradigm settings. Another study showing the susceptibility of robust models to attacks based on Projected Gradient Descent (PGD) when attackers have complete knowledge of the robust model suggests the presence of “illusionary robust features”—robust features highly correlated with incorrect labels—as the root cause of this vulnerability. These findings complicate the analysis of DNNs’ robustness and reveal limitations, without offering concrete solutions. This paper extends previous works by reevaluating the susceptibility of the “robust model” to AutoAttack. Considering “illusionary robust features” as the root cause of this susceptibility, we propose a novel robustification algorithm that generates a “purified robust dataset”. This robustification method not only nullifies the effect of features weakly correlated with correct labels (non-robust features) but also features highly correlated with incorrect labels (illusionary robust features). We evaluated the robustness of the models trained on “standard”, “robust”, and “purified robust” datasets against various strategies based on state-of-the-art AutoAttack and PGD attacks. These evaluations resulted in a better understanding of how the presence of “non-robust” and “illusionary robust” features in datasets and classifiers and their entanglements can result in the susceptibility of DNNs. Our experiment also shows that employing our robustification method, which filters out the effect of “non-robust” and “illusionary robust” features in both train and test sets, effectively addresses the vulnerabilities of DNNs, regardless of the mentioned entanglements. The contributions of this paper advance the understanding of DNN vulnerabilities and provide a more robust solution against sophisticated adversarial attacks.\",\"PeriodicalId\":13079,\"journal\":{\"name\":\"IEEE Access\",\"volume\":\"13 \",\"pages\":\"154678-154694\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11145438\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Access\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11145438/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11145438/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Unveiling Illusionary Robust Features: A Novel Approach for Adversarial Defenses in Deep Neural Networks
Deep Neural Networks (DNNs) are vulnerable to visually imperceptible perturbations, known as Adversarial Examples (AEs). The leading hypothesis attributes this susceptibility to “non-robust features,” which are highly predictive but fragile. Recent studies have challenged the robustness of models trained on robust features. One study demonstrates that models trained on robust features are vulnerable to AutoAttack in cross-paradigm settings. Another study showing the susceptibility of robust models to attacks based on Projected Gradient Descent (PGD) when attackers have complete knowledge of the robust model suggests the presence of “illusionary robust features”—robust features highly correlated with incorrect labels—as the root cause of this vulnerability. These findings complicate the analysis of DNNs’ robustness and reveal limitations, without offering concrete solutions. This paper extends previous works by reevaluating the susceptibility of the “robust model” to AutoAttack. Considering “illusionary robust features” as the root cause of this susceptibility, we propose a novel robustification algorithm that generates a “purified robust dataset”. This robustification method not only nullifies the effect of features weakly correlated with correct labels (non-robust features) but also features highly correlated with incorrect labels (illusionary robust features). We evaluated the robustness of the models trained on “standard”, “robust”, and “purified robust” datasets against various strategies based on state-of-the-art AutoAttack and PGD attacks. These evaluations resulted in a better understanding of how the presence of “non-robust” and “illusionary robust” features in datasets and classifiers and their entanglements can result in the susceptibility of DNNs. Our experiment also shows that employing our robustification method, which filters out the effect of “non-robust” and “illusionary robust” features in both train and test sets, effectively addresses the vulnerabilities of DNNs, regardless of the mentioned entanglements. The contributions of this paper advance the understanding of DNN vulnerabilities and provide a more robust solution against sophisticated adversarial attacks.
IEEE AccessCOMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
9.80
自引率
7.70%
发文量
6673
审稿时长
6 weeks
期刊介绍:
IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest.
IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on:
Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals.
Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering.
Development of new or improved fabrication or manufacturing techniques.
Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.