Cho Xuan Do, Nguyen Trong Luu, Phuong Thi Lan Nguyen
{"title":"Optimizing software vulnerability detection using RoBERTa and machine learning","authors":"Cho Xuan Do, Nguyen Trong Luu, Phuong Thi Lan Nguyen","doi":"10.1007/s10515-024-00440-1","DOIUrl":null,"url":null,"abstract":"<div><p>Detecting vulnerabilities in source code written in C and C + + is currently essential as attack techniques against systems seek to find, exploit, and attack these vulnerabilities. In this article, to improve the effectiveness of the source code vulnerability detection process, we propose a new approach based on building and representing source code features using natural language processing (NLP) techniques. Our proposal in the article consists of two main stages: (i) building a feature profile of the source code using the RoBERTa model, and (ii) classifying source code based on the feature profile using a supervised machine learning algorithm. Specifically, with our proposal utilizing the pre-trained RoBERTa model, we have successfully built and represented important features of source code as complete vectors, thereby enhancing the effectiveness of prediction and vulnerability detection models. The experimental part of our article compared and evaluated our proposal with other approaches on the FFmpeg + Qume dataset. The experimental results in the article showed that the approach in this study was superior to other research directions on all measures. Therefore, the proposal to use NLP techniques based on the RoBERTa model not only has scientific significance as a new research direction that has not been proposed for application but also has practical significance when all experimental results are highly effective.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"31 2","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-024-00440-1","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Detecting vulnerabilities in source code written in C and C + + is currently essential as attack techniques against systems seek to find, exploit, and attack these vulnerabilities. In this article, to improve the effectiveness of the source code vulnerability detection process, we propose a new approach based on building and representing source code features using natural language processing (NLP) techniques. Our proposal in the article consists of two main stages: (i) building a feature profile of the source code using the RoBERTa model, and (ii) classifying source code based on the feature profile using a supervised machine learning algorithm. Specifically, with our proposal utilizing the pre-trained RoBERTa model, we have successfully built and represented important features of source code as complete vectors, thereby enhancing the effectiveness of prediction and vulnerability detection models. The experimental part of our article compared and evaluated our proposal with other approaches on the FFmpeg + Qume dataset. The experimental results in the article showed that the approach in this study was superior to other research directions on all measures. Therefore, the proposal to use NLP techniques based on the RoBERTa model not only has scientific significance as a new research direction that has not been proposed for application but also has practical significance when all experimental results are highly effective.
目前,检测用 C 和 C + + 编写的源代码中的漏洞至关重要,因为针对系统的攻击技术试图找到、利用和攻击这些漏洞。在本文中,为了提高源代码漏洞检测过程的有效性,我们提出了一种基于使用自然语言处理(NLP)技术构建和表示源代码特征的新方法。我们在文章中提出的建议包括两个主要阶段:(i) 使用 RoBERTa 模型建立源代码的特征轮廓;(ii) 使用监督机器学习算法根据特征轮廓对源代码进行分类。具体来说,我们利用预先训练好的 RoBERTa 模型,成功地构建了源代码的重要特征,并将其表示为完整的向量,从而提高了预测和漏洞检测模型的有效性。文章的实验部分在 FFmpeg + Qume 数据集上对我们的建议与其他方法进行了比较和评估。文章中的实验结果表明,本研究的方法在所有指标上都优于其他研究方向。因此,基于 RoBERTa 模型使用 NLP 技术的建议不仅具有科学意义,是一个尚未提出应用的新研究方向,而且在所有实验结果都非常有效的情况下,还具有实际意义。
期刊介绍:
This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes.
Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.