{"title":"CLNX:为识别 C/C++ 漏洞贡献提交架起代码与自然语言的桥梁","authors":"Zeqing Qin, Yiwei Wu, Lansheng Han","doi":"arxiv-2409.07407","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have shown great promise in vulnerability\nidentification. As C/C++ comprises half of the Open-Source Software (OSS)\nvulnerabilities over the past decade and updates in OSS mainly occur through\ncommits, enhancing LLMs' ability to identify C/C++ Vulnerability-Contributing\nCommits (VCCs) is essential. However, current studies primarily focus on\nfurther pre-training LLMs on massive code datasets, which is resource-intensive\nand poses efficiency challenges. In this paper, we enhance the ability of\nBERT-based LLMs to identify C/C++ VCCs in a lightweight manner. We propose\nCodeLinguaNexus (CLNX) as a bridge facilitating communication between C/C++\nprograms and LLMs. Based on commits, CLNX efficiently converts the source code\ninto a more natural representation while preserving key details. Specifically,\nCLNX first applies structure-level naturalization to decompose complex\nprograms, followed by token-level naturalization to interpret complex symbols.\nWe evaluate CLNX on public datasets of 25,872 C/C++ functions with their\ncommits. The results show that CLNX significantly enhances the performance of\nLLMs on identifying C/C++ VCCs. Moreover, CLNX-equipped CodeBERT achieves new\nstate-of-the-art and identifies 38 OSS vulnerabilities in the real world.","PeriodicalId":501332,"journal":{"name":"arXiv - CS - Cryptography and Security","volume":"7 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CLNX: Bridging Code and Natural Language for C/C++ Vulnerability-Contributing Commits Identification\",\"authors\":\"Zeqing Qin, Yiwei Wu, Lansheng Han\",\"doi\":\"arxiv-2409.07407\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models (LLMs) have shown great promise in vulnerability\\nidentification. As C/C++ comprises half of the Open-Source Software (OSS)\\nvulnerabilities over the past decade and updates in OSS mainly occur through\\ncommits, enhancing LLMs' ability to identify C/C++ Vulnerability-Contributing\\nCommits (VCCs) is essential. However, current studies primarily focus on\\nfurther pre-training LLMs on massive code datasets, which is resource-intensive\\nand poses efficiency challenges. In this paper, we enhance the ability of\\nBERT-based LLMs to identify C/C++ VCCs in a lightweight manner. We propose\\nCodeLinguaNexus (CLNX) as a bridge facilitating communication between C/C++\\nprograms and LLMs. Based on commits, CLNX efficiently converts the source code\\ninto a more natural representation while preserving key details. Specifically,\\nCLNX first applies structure-level naturalization to decompose complex\\nprograms, followed by token-level naturalization to interpret complex symbols.\\nWe evaluate CLNX on public datasets of 25,872 C/C++ functions with their\\ncommits. The results show that CLNX significantly enhances the performance of\\nLLMs on identifying C/C++ VCCs. Moreover, CLNX-equipped CodeBERT achieves new\\nstate-of-the-art and identifies 38 OSS vulnerabilities in the real world.\",\"PeriodicalId\":501332,\"journal\":{\"name\":\"arXiv - CS - Cryptography and Security\",\"volume\":\"7 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Cryptography and Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07407\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Cryptography and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07407","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
CLNX: Bridging Code and Natural Language for C/C++ Vulnerability-Contributing Commits Identification
Large Language Models (LLMs) have shown great promise in vulnerability
identification. As C/C++ comprises half of the Open-Source Software (OSS)
vulnerabilities over the past decade and updates in OSS mainly occur through
commits, enhancing LLMs' ability to identify C/C++ Vulnerability-Contributing
Commits (VCCs) is essential. However, current studies primarily focus on
further pre-training LLMs on massive code datasets, which is resource-intensive
and poses efficiency challenges. In this paper, we enhance the ability of
BERT-based LLMs to identify C/C++ VCCs in a lightweight manner. We propose
CodeLinguaNexus (CLNX) as a bridge facilitating communication between C/C++
programs and LLMs. Based on commits, CLNX efficiently converts the source code
into a more natural representation while preserving key details. Specifically,
CLNX first applies structure-level naturalization to decompose complex
programs, followed by token-level naturalization to interpret complex symbols.
We evaluate CLNX on public datasets of 25,872 C/C++ functions with their
commits. The results show that CLNX significantly enhances the performance of
LLMs on identifying C/C++ VCCs. Moreover, CLNX-equipped CodeBERT achieves new
state-of-the-art and identifies 38 OSS vulnerabilities in the real world.