Generalization to Mitigate Synonym Substitution Attacks

Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures Pub Date : 2020-11-01 DOI:10.18653/v1/2020.deelio-1.3

Basemah Alshemali, J. Kalita

{"title":"Generalization to Mitigate Synonym Substitution Attacks","authors":"Basemah Alshemali, J. Kalita","doi":"10.18653/v1/2020.deelio-1.3","DOIUrl":null,"url":null,"abstract":"Studies have shown that deep neural networks (DNNs) are vulnerable to adversarial examples – perturbed inputs that cause DNN-based models to produce incorrect results. One robust adversarial attack in the NLP domain is the synonym substitution. In attacks of this variety, the adversary substitutes words with synonyms. Since synonym substitution perturbations aim to satisfy all lexical, grammatical, and semantic constraints, they are difficult to detect with automatic syntax check as well as by humans. In this paper, we propose a structure-free defensive method that is capable of improving the performance of DNN-based models with both clean and adversarial data. Our findings show that replacing the embeddings of the important words in the input samples with the average of their synonyms’ embeddings can significantly improve the generalization of DNN-based classifiers. By doing so, we reduce model sensitivity to particular words in the input samples. Our results indicate that the proposed defense is not only capable of defending against adversarial attacks, but is also capable of improving the performance of DNN-based models when tested on benign data. On average, the proposed defense improved the classification accuracy of the CNN and Bi-LSTM models by 41.30% and 55.66%, respectively, when tested under adversarial attacks. Extended investigation shows that our defensive method can improve the robustness of nonneural models, achieving an average of 17.62% and 22.93% classification accuracy increase on the SVM and XGBoost models, respectively. The proposed defensive method has also shown an average of 26.60% classification accuracy improvement when tested with the infamous BERT model. Our algorithm is generic enough to be applied in any NLP domain and to any model trained on any natural language.","PeriodicalId":199628,"journal":{"name":"Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2020.deelio-1.3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Studies have shown that deep neural networks (DNNs) are vulnerable to adversarial examples – perturbed inputs that cause DNN-based models to produce incorrect results. One robust adversarial attack in the NLP domain is the synonym substitution. In attacks of this variety, the adversary substitutes words with synonyms. Since synonym substitution perturbations aim to satisfy all lexical, grammatical, and semantic constraints, they are difficult to detect with automatic syntax check as well as by humans. In this paper, we propose a structure-free defensive method that is capable of improving the performance of DNN-based models with both clean and adversarial data. Our findings show that replacing the embeddings of the important words in the input samples with the average of their synonyms’ embeddings can significantly improve the generalization of DNN-based classifiers. By doing so, we reduce model sensitivity to particular words in the input samples. Our results indicate that the proposed defense is not only capable of defending against adversarial attacks, but is also capable of improving the performance of DNN-based models when tested on benign data. On average, the proposed defense improved the classification accuracy of the CNN and Bi-LSTM models by 41.30% and 55.66%, respectively, when tested under adversarial attacks. Extended investigation shows that our defensive method can improve the robustness of nonneural models, achieving an average of 17.62% and 22.93% classification accuracy increase on the SVM and XGBoost models, respectively. The proposed defensive method has also shown an average of 26.60% classification accuracy improvement when tested with the infamous BERT model. Our algorithm is generic enough to be applied in any NLP domain and to any model trained on any natural language.

查看原文本刊更多论文

缓解同义词替换攻击的泛化方法

研究表明，深度神经网络(dnn)很容易受到对抗性示例的影响——干扰的输入会导致基于dnn的模型产生不正确的结果。同义词替换是NLP领域中一个强大的对抗性攻击。在这种类型的攻击中，对手用同义词替换单词。由于同义词替换扰动旨在满足所有词汇、语法和语义约束，因此很难通过自动语法检查和人工检测到它们。在本文中，我们提出了一种无结构的防御方法，该方法能够提高基于dnn的模型的性能，同时具有干净和对抗数据。我们的研究结果表明，将输入样本中重要词的嵌入替换为同义词嵌入的平均值可以显著提高基于dnn的分类器的泛化能力。通过这样做，我们降低了模型对输入样本中特定单词的敏感性。我们的结果表明，提出的防御不仅能够防御对抗性攻击，而且在良性数据上测试时也能够提高基于dnn的模型的性能。在对抗性攻击下进行测试时，所提出的防御方法平均将CNN和Bi-LSTM模型的分类准确率分别提高了41.30%和55.66%。扩展研究表明，我们的防御方法可以提高非神经模型的鲁棒性，在SVM和XGBoost模型上的分类准确率平均分别提高17.62%和22.93%。当与臭名昭著的BERT模型进行测试时，所提出的防御方法也显示出平均26.60%的分类准确率提高。我们的算法具有足够的通用性，可以应用于任何NLP领域和任何自然语言训练的任何模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

自引率

0.00%

发文量