你看到了我想让你看到的:毒害神经代码搜索中的漏洞

软件产业与工程 Pub Date : 2022-11-07 DOI:10.1145/3540250.3549153

Yao Wan, Shijie Zhang, Hongyu Zhang, Yulei Sui, Guandong Xu, Dezhong Yao, Hai Jin, Lichao Sun

{"title":"你看到了我想让你看到的:毒害神经代码搜索中的漏洞","authors":"Yao Wan, Shijie Zhang, Hongyu Zhang, Yulei Sui, Guandong Xu, Dezhong Yao, Hai Jin, Lichao Sun","doi":"10.1145/3540250.3549153","DOIUrl":null,"url":null,"abstract":"Searching and reusing code snippets from open-source software repositories based on natural-language queries can greatly improve programming productivity.Recently, deep-learning-based approaches have become increasingly popular for code search. Despite substantial progress in training accurate models of code search, the robustness of these models has received little attention so far. In this paper, we aim to study and understand the security and robustness of code search models by answering the following question: Can we inject backdoors into deep-learning-based code search models? If so, can we detect poisoned data and remove these backdoors? This work studies and develops a series of backdoor attacks on the deep-learning-based models for code search, through data poisoning. We first show that existing models are vulnerable to data-poisoning-based backdoor attacks. We then introduce a simple yet effective attack on neural code search models by poisoning their corresponding training dataset. Moreover, we demonstrate that attacks can also influence the ranking of the code search results by adding a few specially-crafted source code files to the training corpus. We show that this type of backdoor attack is effective for several representative deep-learning-based code search systems, and can successfully manipulate the ranking list of searching results. Taking the bidirectional RNN-based code search system as an example, the normalized ranking of the target candidate can be significantly raised from top 50% to top 4.43%, given a query containing an attacker targeted word, e.g., file. To defend a model against such attack, we empirically examine an existing popular defense strategy and evaluate its performance. Our results show the explored defense strategy is not yet effective in our proposed backdoor attack for code search systems.","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"You see what I want you to see: poisoning vulnerabilities in neural code search\",\"authors\":\"Yao Wan, Shijie Zhang, Hongyu Zhang, Yulei Sui, Guandong Xu, Dezhong Yao, Hai Jin, Lichao Sun\",\"doi\":\"10.1145/3540250.3549153\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Searching and reusing code snippets from open-source software repositories based on natural-language queries can greatly improve programming productivity.Recently, deep-learning-based approaches have become increasingly popular for code search. Despite substantial progress in training accurate models of code search, the robustness of these models has received little attention so far. In this paper, we aim to study and understand the security and robustness of code search models by answering the following question: Can we inject backdoors into deep-learning-based code search models? If so, can we detect poisoned data and remove these backdoors? This work studies and develops a series of backdoor attacks on the deep-learning-based models for code search, through data poisoning. We first show that existing models are vulnerable to data-poisoning-based backdoor attacks. We then introduce a simple yet effective attack on neural code search models by poisoning their corresponding training dataset. Moreover, we demonstrate that attacks can also influence the ranking of the code search results by adding a few specially-crafted source code files to the training corpus. We show that this type of backdoor attack is effective for several representative deep-learning-based code search systems, and can successfully manipulate the ranking list of searching results. Taking the bidirectional RNN-based code search system as an example, the normalized ranking of the target candidate can be significantly raised from top 50% to top 4.43%, given a query containing an attacker targeted word, e.g., file. To defend a model against such attack, we empirically examine an existing popular defense strategy and evaluate its performance. Our results show the explored defense strategy is not yet effective in our proposed backdoor attack for code search systems.\",\"PeriodicalId\":68155,\"journal\":{\"name\":\"软件产业与工程\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"软件产业与工程\",\"FirstCategoryId\":\"1089\",\"ListUrlMain\":\"https://doi.org/10.1145/3540250.3549153\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"软件产业与工程","FirstCategoryId":"1089","ListUrlMain":"https://doi.org/10.1145/3540250.3549153","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

摘要

基于自然语言查询从开源软件库中搜索和重用代码片段可以极大地提高编程效率。最近，基于深度学习的方法在代码搜索中越来越受欢迎。尽管在训练精确的代码搜索模型方面取得了实质性进展，但这些模型的鲁棒性迄今为止很少受到关注。在本文中，我们旨在通过回答以下问题来研究和理解代码搜索模型的安全性和鲁棒性:我们可以在基于深度学习的代码搜索模型中注入后门吗?如果是这样，我们能检测出有毒数据并移除这些后门吗?本工作研究并开发了一系列通过数据中毒对基于深度学习的代码搜索模型的后门攻击。我们首先表明，现有模型容易受到基于数据中毒的后门攻击。然后，我们通过毒害相应的训练数据集，引入一种简单而有效的攻击神经代码搜索模型。此外，我们还证明了攻击还可以通过向训练语料库中添加一些特制的源代码文件来影响代码搜索结果的排名。我们证明了这种类型的后门攻击对几个具有代表性的基于深度学习的代码搜索系统是有效的，并且可以成功地操纵搜索结果的排名列表。以基于双向rnn的代码搜索系统为例，给定一个包含攻击者目标词(如file)的查询，目标候选词的归一化排名可以从前50%显著提高到前4.43%。为了保护模型免受此类攻击，我们实证地检查了现有的流行防御策略并评估其性能。我们的研究结果表明，所探索的防御策略在我们提出的针对代码搜索系统的后门攻击中尚不有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

You see what I want you to see: poisoning vulnerabilities in neural code search

Searching and reusing code snippets from open-source software repositories based on natural-language queries can greatly improve programming productivity.Recently, deep-learning-based approaches have become increasingly popular for code search. Despite substantial progress in training accurate models of code search, the robustness of these models has received little attention so far. In this paper, we aim to study and understand the security and robustness of code search models by answering the following question: Can we inject backdoors into deep-learning-based code search models? If so, can we detect poisoned data and remove these backdoors? This work studies and develops a series of backdoor attacks on the deep-learning-based models for code search, through data poisoning. We first show that existing models are vulnerable to data-poisoning-based backdoor attacks. We then introduce a simple yet effective attack on neural code search models by poisoning their corresponding training dataset. Moreover, we demonstrate that attacks can also influence the ranking of the code search results by adding a few specially-crafted source code files to the training corpus. We show that this type of backdoor attack is effective for several representative deep-learning-based code search systems, and can successfully manipulate the ranking list of searching results. Taking the bidirectional RNN-based code search system as an example, the normalized ranking of the target candidate can be significantly raised from top 50% to top 4.43%, given a query containing an attacker targeted word, e.g., file. To defend a model against such attack, we empirically examine an existing popular defense strategy and evaluate its performance. Our results show the explored defense strategy is not yet effective in our proposed backdoor attack for code search systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

软件产业与工程

自引率

0.00%

发文量

676