Optimizing software vulnerability detection using RoBERTa and machine learning

IF 2 2区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Cho Xuan Do, Nguyen Trong Luu, Phuong Thi Lan Nguyen
{"title":"Optimizing software vulnerability detection using RoBERTa and machine learning","authors":"Cho Xuan Do,&nbsp;Nguyen Trong Luu,&nbsp;Phuong Thi Lan Nguyen","doi":"10.1007/s10515-024-00440-1","DOIUrl":null,"url":null,"abstract":"<div><p>Detecting vulnerabilities in source code written in C and C +  + is currently essential as attack techniques against systems seek to find, exploit, and attack these vulnerabilities. In this article, to improve the effectiveness of the source code vulnerability detection process, we propose a new approach based on building and representing source code features using natural language processing (NLP) techniques. Our proposal in the article consists of two main stages: (i) building a feature profile of the source code using the RoBERTa model, and (ii) classifying source code based on the feature profile using a supervised machine learning algorithm. Specifically, with our proposal utilizing the pre-trained RoBERTa model, we have successfully built and represented important features of source code as complete vectors, thereby enhancing the effectiveness of prediction and vulnerability detection models. The experimental part of our article compared and evaluated our proposal with other approaches on the FFmpeg + Qume dataset. The experimental results in the article showed that the approach in this study was superior to other research directions on all measures. Therefore, the proposal to use NLP techniques based on the RoBERTa model not only has scientific significance as a new research direction that has not been proposed for application but also has practical significance when all experimental results are highly effective.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"31 2","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-024-00440-1","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Detecting vulnerabilities in source code written in C and C +  + is currently essential as attack techniques against systems seek to find, exploit, and attack these vulnerabilities. In this article, to improve the effectiveness of the source code vulnerability detection process, we propose a new approach based on building and representing source code features using natural language processing (NLP) techniques. Our proposal in the article consists of two main stages: (i) building a feature profile of the source code using the RoBERTa model, and (ii) classifying source code based on the feature profile using a supervised machine learning algorithm. Specifically, with our proposal utilizing the pre-trained RoBERTa model, we have successfully built and represented important features of source code as complete vectors, thereby enhancing the effectiveness of prediction and vulnerability detection models. The experimental part of our article compared and evaluated our proposal with other approaches on the FFmpeg + Qume dataset. The experimental results in the article showed that the approach in this study was superior to other research directions on all measures. Therefore, the proposal to use NLP techniques based on the RoBERTa model not only has scientific significance as a new research direction that has not been proposed for application but also has practical significance when all experimental results are highly effective.

Abstract Image

利用 RoBERTa 和机器学习优化软件漏洞检测
目前,检测用 C 和 C + + 编写的源代码中的漏洞至关重要,因为针对系统的攻击技术试图找到、利用和攻击这些漏洞。在本文中,为了提高源代码漏洞检测过程的有效性,我们提出了一种基于使用自然语言处理(NLP)技术构建和表示源代码特征的新方法。我们在文章中提出的建议包括两个主要阶段:(i) 使用 RoBERTa 模型建立源代码的特征轮廓;(ii) 使用监督机器学习算法根据特征轮廓对源代码进行分类。具体来说,我们利用预先训练好的 RoBERTa 模型,成功地构建了源代码的重要特征,并将其表示为完整的向量,从而提高了预测和漏洞检测模型的有效性。文章的实验部分在 FFmpeg + Qume 数据集上对我们的建议与其他方法进行了比较和评估。文章中的实验结果表明,本研究的方法在所有指标上都优于其他研究方向。因此,基于 RoBERTa 模型使用 NLP 技术的建议不仅具有科学意义,是一个尚未提出应用的新研究方向,而且在所有实验结果都非常有效的情况下,还具有实际意义。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Automated Software Engineering
Automated Software Engineering 工程技术-计算机:软件工程
CiteScore
4.80
自引率
11.80%
发文量
51
审稿时长
>12 weeks
期刊介绍: This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes. Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信