Trap and Replace: Defending Backdoor Attacks by Trapping Them into an Easy-to-Replace Subnetwork

Advances in neural information processing systems Pub Date : 2022-10-12 DOI:10.48550/arXiv.2210.06428

Haotao Wang, Junyuan Hong, Aston Zhang, Jiayu Zhou, Zhangyang Wang

{"title":"Trap and Replace: Defending Backdoor Attacks by Trapping Them into an Easy-to-Replace Subnetwork","authors":"Haotao Wang, Junyuan Hong, Aston Zhang, Jiayu Zhou, Zhangyang Wang","doi":"10.48550/arXiv.2210.06428","DOIUrl":null,"url":null,"abstract":"Deep neural networks (DNNs) are vulnerable to backdoor attacks. Previous works have shown it extremely challenging to unlearn the undesired backdoor behavior from the network, since the entire network can be affected by the backdoor samples. In this paper, we propose a brand-new backdoor defense strategy, which makes it much easier to remove the harmful influence of backdoor samples from the model. Our defense strategy, Trap and Replace, consists of two stages. In the first stage, we bait and trap the backdoors in a small and easy-to-replace subnetwork. Specifically, we add an auxiliary image reconstruction head on top of the stem network shared with a light-weighted classification head. The intuition is that the auxiliary image reconstruction task encourages the stem network to keep sufficient low-level visual features that are hard to learn but semantically correct, instead of overfitting to the easy-to-learn but semantically incorrect backdoor correlations. As a result, when trained on backdoored datasets, the backdoors are easily baited towards the unprotected classification head, since it is much more vulnerable than the shared stem, leaving the stem network hardly poisoned. In the second stage, we replace the poisoned light-weighted classification head with an untainted one, by re-training it from scratch only on a small holdout dataset with clean samples, while fixing the stem network. As a result, both the stem and the classification head in the final network are hardly affected by backdoor training samples. We evaluate our method against ten different backdoor attacks. Our method outperforms previous state-of-the-art methods by up to 20.57%, 9.80%, and 13.72% attack success rate and on-average 3.14%, 1.80%, and 1.21% clean classification accuracy on CIFAR10, GTSRB, and ImageNet-12, respectively. Code is available at https://github.com/VITA-Group/Trap-and-Replace-Backdoor-Defense.","PeriodicalId":72099,"journal":{"name":"Advances in neural information processing systems","volume":"35 1","pages":"36026-36039"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in neural information processing systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.06428","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Deep neural networks (DNNs) are vulnerable to backdoor attacks. Previous works have shown it extremely challenging to unlearn the undesired backdoor behavior from the network, since the entire network can be affected by the backdoor samples. In this paper, we propose a brand-new backdoor defense strategy, which makes it much easier to remove the harmful influence of backdoor samples from the model. Our defense strategy, Trap and Replace, consists of two stages. In the first stage, we bait and trap the backdoors in a small and easy-to-replace subnetwork. Specifically, we add an auxiliary image reconstruction head on top of the stem network shared with a light-weighted classification head. The intuition is that the auxiliary image reconstruction task encourages the stem network to keep sufficient low-level visual features that are hard to learn but semantically correct, instead of overfitting to the easy-to-learn but semantically incorrect backdoor correlations. As a result, when trained on backdoored datasets, the backdoors are easily baited towards the unprotected classification head, since it is much more vulnerable than the shared stem, leaving the stem network hardly poisoned. In the second stage, we replace the poisoned light-weighted classification head with an untainted one, by re-training it from scratch only on a small holdout dataset with clean samples, while fixing the stem network. As a result, both the stem and the classification head in the final network are hardly affected by backdoor training samples. We evaluate our method against ten different backdoor attacks. Our method outperforms previous state-of-the-art methods by up to 20.57%, 9.80%, and 13.72% attack success rate and on-average 3.14%, 1.80%, and 1.21% clean classification accuracy on CIFAR10, GTSRB, and ImageNet-12, respectively. Code is available at https://github.com/VITA-Group/Trap-and-Replace-Backdoor-Defense.

查看原文本刊更多论文

陷阱和替换：通过将后门攻击陷阱到易于替换的子网络中来防御后门攻击

深度神经网络（DNN）容易受到后门攻击。先前的工作表明，从网络中消除不希望的后门行为是极具挑战性的，因为整个网络都可能受到后门样本的影响。在本文中，我们提出了一种全新的后门防御策略，这使得从模型中去除后门样本的有害影响变得更加容易。我们的防御策略，陷阱和替换，包括两个阶段。在第一阶段，我们将后门放入一个小而易于替换的子网络中。具体来说，我们在与加权分类头共享的主干网络的顶部添加了一个辅助图像重建头。直觉是，辅助图像重建任务鼓励主干网络保持足够的难以学习但语义正确的低级视觉特征，而不是过度适应易于学习但语义不正确的后门相关性。因此，当在后门数据集上训练时，后门很容易被引诱到未受保护的分类头，因为它比共享的词干更容易受到攻击，使词干网络几乎不会中毒。在第二阶段，我们将中毒的轻加权分类头替换为未污染的分类头，只在具有干净样本的小数据集上从头开始重新训练，同时修复干网络。因此，最终网络中的词干和分类头几乎不受后门训练样本的影响。我们针对十种不同的后门攻击对我们的方法进行了评估。我们的方法在CIFAR10、GTSRB和ImageNet-12上的攻击成功率分别高达20.57%、9.80%和13.72%，平均清洁分类准确率分别为3.14%、1.80%和1.21%。代码可在https://github.com/VITA-Group/Trap-and-Replace-Backdoor-Defense.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Advances in neural information processing systems

自引率

0.00%

发文量