Repairing Security Vulnerabilities Using Pre-trained Programming Language Models

2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W) Pub Date : 2022-06-01 DOI:10.1109/dsn-w54100.2022.00027

Kai Huang, Su Yang, Hongyu Sun, Chengyi Sun, Xuejun Li, Yuqing Zhang

{"title":"Repairing Security Vulnerabilities Using Pre-trained Programming Language Models","authors":"Kai Huang, Su Yang, Hongyu Sun, Chengyi Sun, Xuejun Li, Yuqing Zhang","doi":"10.1109/dsn-w54100.2022.00027","DOIUrl":null,"url":null,"abstract":"Repairing software bugs with automated solutions is a long-standing goal of researchers. Some of the latest automated program repair (APR) tools leverage natural language processing (NLP) techniques to repair software bugs. But natural languages (NL) and programming languages (PL) have significant differences, which leads to the fact that they may not be able to handle PL tasks well. Moreover, due to the difference between the vulnerability repair task and bug repair task, the performance of these tools on vulnerability repair is not yet known. To address these issues, we attempt to use large-scale pre-trained PL models (CodeBERT and GraphCodeBERT) for the vulnerability repair task based on the characteristics of PL and explore the real-world performance of the state-of-the-art data-driven approaches for vulnerability repair. The results show that using pre-trained PL models can better capture and process PL features and accomplish multi-line vulnerability repair. Specifically, our solution achieves advanced results (single-line repair accuracy 95.47%, multi-line repair accuracy 90.06%). These results outperform the state-of-the-art data-driven approaches and demonstrate that adding rich data-dependent features can help solve more complex code repair problems. Besides, we also discuss the previous work and our approach, pointing out some shortcomings and solutions we can work on in the future.","PeriodicalId":349937,"journal":{"name":"2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/dsn-w54100.2022.00027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Repairing software bugs with automated solutions is a long-standing goal of researchers. Some of the latest automated program repair (APR) tools leverage natural language processing (NLP) techniques to repair software bugs. But natural languages (NL) and programming languages (PL) have significant differences, which leads to the fact that they may not be able to handle PL tasks well. Moreover, due to the difference between the vulnerability repair task and bug repair task, the performance of these tools on vulnerability repair is not yet known. To address these issues, we attempt to use large-scale pre-trained PL models (CodeBERT and GraphCodeBERT) for the vulnerability repair task based on the characteristics of PL and explore the real-world performance of the state-of-the-art data-driven approaches for vulnerability repair. The results show that using pre-trained PL models can better capture and process PL features and accomplish multi-line vulnerability repair. Specifically, our solution achieves advanced results (single-line repair accuracy 95.47%, multi-line repair accuracy 90.06%). These results outperform the state-of-the-art data-driven approaches and demonstrate that adding rich data-dependent features can help solve more complex code repair problems. Besides, we also discuss the previous work and our approach, pointing out some shortcomings and solutions we can work on in the future.

查看原文本刊更多论文

使用预训练的编程语言模型修复安全漏洞

用自动化解决方案修复软件漏洞是研究人员的一个长期目标。一些最新的自动程序修复(APR)工具利用自然语言处理(NLP)技术来修复软件错误。但是自然语言(NL)和编程语言(PL)有显著的差异，这导致它们可能无法很好地处理PL任务。此外，由于漏洞修复任务和bug修复任务的不同，这些工具在漏洞修复上的性能尚不清楚。为了解决这些问题，我们尝试使用大规模预训练的PL模型(CodeBERT和GraphCodeBERT)来完成基于PL特征的漏洞修复任务，并探索用于漏洞修复的最新数据驱动方法的实际性能。结果表明，使用预训练的PL模型可以更好地捕获和处理PL特征，实现多线漏洞修复。具体来说，我们的解决方案达到了先进的效果(单线修复精度95.47%，多线修复精度90.06%)。这些结果优于最先进的数据驱动方法，并证明添加丰富的数据相关特性可以帮助解决更复杂的代码修复问题。此外，我们还讨论了之前的工作和我们的方法，指出了一些不足之处和我们未来可以努力的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W)

自引率

0.00%

发文量