PromptFix：通过对抗性提示调整的少量后门移除。

Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting Pub Date : 2024-06-01 DOI:10.18653/v1/2024.naacl-long.177

Tianrong Zhang, Zhaohan Xi, Ting Wang, Prasenjit Mitra, Jinghui Chen

{"title":"PromptFix：通过对抗性提示调整的少量后门移除。","authors":"Tianrong Zhang, Zhaohan Xi, Ting Wang, Prasenjit Mitra, Jinghui Chen","doi":"10.18653/v1/2024.naacl-long.177","DOIUrl":null,"url":null,"abstract":"Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances. Meanwhile, the soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting as the most popular training paradigms for natural language processing (NLP) models. Nevertheless, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented. In this paper, we propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings. Unlike existing NLP backdoor removal methods, which rely on accurate trigger inversion and subsequent model fine-tuning, PromptFix keeps the model parameters intact and only utilizes two extra sets of soft tokens which approximate the trigger and counteract it respectively. The use of soft tokens and adversarial optimization eliminates the need to enumerate possible backdoor configurations and enables an adaptive balance between trigger finding and preservation of performance. Experiments with various backdoor attacks validate the effectiveness of the proposed method and the performances when domain shift is present further shows PromptFix's applicability to models pre-trained on unknown data source which is the common case in prompt tuning scenarios.","PeriodicalId":74542,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","volume":"1 ","pages":"3212-3225"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395398/pdf/","citationCount":"0","resultStr":"{\"title\":\"PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning.\",\"authors\":\"Tianrong Zhang, Zhaohan Xi, Ting Wang, Prasenjit Mitra, Jinghui Chen\",\"doi\":\"10.18653/v1/2024.naacl-long.177\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances. Meanwhile, the soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting as the most popular training paradigms for natural language processing (NLP) models. Nevertheless, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented. In this paper, we propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings. Unlike existing NLP backdoor removal methods, which rely on accurate trigger inversion and subsequent model fine-tuning, PromptFix keeps the model parameters intact and only utilizes two extra sets of soft tokens which approximate the trigger and counteract it respectively. The use of soft tokens and adversarial optimization eliminates the need to enumerate possible backdoor configurations and enables an adaptive balance between trigger finding and preservation of performance. Experiments with various backdoor attacks validate the effectiveness of the proposed method and the performances when domain shift is present further shows PromptFix's applicability to models pre-trained on unknown data source which is the common case in prompt tuning scenarios.\",\"PeriodicalId\":74542,\"journal\":{\"name\":\"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting\",\"volume\":\"1 \",\"pages\":\"3212-3225\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395398/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2024.naacl-long.177\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2024.naacl-long.177","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在过去的几年中，预训练语言模型（plm）以其无与伦比的性能引起了人们的极大关注。与此同时，训练plm的成本飙升，以及它们惊人的通用性，共同促成了几次微调和提示，成为自然语言处理（NLP）模型中最流行的训练范例。然而，现有的研究表明，这些NLP模型可以被后门，这样当触发令牌出现时，模型行为就会被操纵。在本文中，我们提出了PromptFix，这是一种新的后门缓解策略，用于NLP模型，通过在少数镜头设置中进行对抗性提示调整。现有的NLP后门去除方法依赖于精确的触发反转和随后的模型微调，与之不同的是，PromptFix保持模型参数完整，只使用两组额外的软令牌，分别近似触发和抵消触发。使用软令牌和对抗性优化消除了枚举可能的后门配置的需要，并在触发器查找和性能保持之间实现了自适应平衡。针对各种后门攻击的实验验证了该方法的有效性，在存在域漂移的情况下的性能进一步证明了PromptFix对未知数据源预训练模型的适用性，这是提示调优场景中常见的情况。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning.

Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances. Meanwhile, the soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting as the most popular training paradigms for natural language processing (NLP) models. Nevertheless, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented. In this paper, we propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings. Unlike existing NLP backdoor removal methods, which rely on accurate trigger inversion and subsequent model fine-tuning, PromptFix keeps the model parameters intact and only utilizes two extra sets of soft tokens which approximate the trigger and counteract it respectively. The use of soft tokens and adversarial optimization eliminates the need to enumerate possible backdoor configurations and enables an adaptive balance between trigger finding and preservation of performance. Experiments with various backdoor attacks validate the effectiveness of the proposed method and the performances when domain shift is present further shows PromptFix's applicability to models pre-trained on unknown data source which is the common case in prompt tuning scenarios.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting

自引率

0.00%

发文量