利用源代码检测漏洞类型的余弦相似性标记技术

IF 4.8 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Computers & Security Pub Date : 2024-08-21 DOI:10.1016/j.cose.2024.104059

M. Maruf Öztürk

{"title":"利用源代码检测漏洞类型的余弦相似性标记技术","authors":"M. Maruf Öztürk","doi":"10.1016/j.cose.2024.104059","DOIUrl":null,"url":null,"abstract":"<div><p>Vulnerability detection is of great importance in providing reliability to software systems. Although existing methods achieve remarkable success in vulnerability detection, they have several disadvantages as follows: (1) The irrelevant information is removed from source codes, which have a high noise ratio, thereby utilizing deep learning methods and devising experiments featuring high accuracy. However, deep learning-based detection methods necessitate large-scale datasets. This results in computational hardship with respect to vulnerability detection in small-scale software systems. (2) The majority of the studies perform feature selection by processing vulnerability commits. Despite tremendous endeavors, there are few works detecting vulnerability with source codes. To solve these two problems, in this study, a novel labeling and vulnerability detection algorithm is proposed. The algorithm first exploits source codes with the help of a keyword vulnerability matrix. After that, an ultimate encoded matrix is generated by word2vec, thereby combining the labeling vector with the source code matrix to reveal a trainable dataset for a generalized linear model (GLM). Different from preceding studies, our method performs vulnerability detection without requiring vulnerability commits but using source codes. In addition to this, similar studies generally aim to bring sophisticated solutions for just one type of programming language. Conversely, our study develops vulnerability keywords for three programming languages including C#, Java, and C++, and creates the related labeling vectors by regarding the keyword matrix. The proposed method outperformed the baseline approaches for most of the experimental datasets with over 90% of the area under the curve (AUC). Further, there is a 7.7% margin between our method and the alternatives on average for Recall, Precision, and F1-score with respect to five types of vulnerabilities.</p></div>","PeriodicalId":51004,"journal":{"name":"Computers & Security","volume":"146 ","pages":""},"PeriodicalIF":4.8000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A cosine similarity-based labeling technique for vulnerability type detection using source codes\",\"authors\":\"M. Maruf Öztürk\",\"doi\":\"10.1016/j.cose.2024.104059\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Vulnerability detection is of great importance in providing reliability to software systems. Although existing methods achieve remarkable success in vulnerability detection, they have several disadvantages as follows: (1) The irrelevant information is removed from source codes, which have a high noise ratio, thereby utilizing deep learning methods and devising experiments featuring high accuracy. However, deep learning-based detection methods necessitate large-scale datasets. This results in computational hardship with respect to vulnerability detection in small-scale software systems. (2) The majority of the studies perform feature selection by processing vulnerability commits. Despite tremendous endeavors, there are few works detecting vulnerability with source codes. To solve these two problems, in this study, a novel labeling and vulnerability detection algorithm is proposed. The algorithm first exploits source codes with the help of a keyword vulnerability matrix. After that, an ultimate encoded matrix is generated by word2vec, thereby combining the labeling vector with the source code matrix to reveal a trainable dataset for a generalized linear model (GLM). Different from preceding studies, our method performs vulnerability detection without requiring vulnerability commits but using source codes. In addition to this, similar studies generally aim to bring sophisticated solutions for just one type of programming language. Conversely, our study develops vulnerability keywords for three programming languages including C#, Java, and C++, and creates the related labeling vectors by regarding the keyword matrix. The proposed method outperformed the baseline approaches for most of the experimental datasets with over 90% of the area under the curve (AUC). Further, there is a 7.7% margin between our method and the alternatives on average for Recall, Precision, and F1-score with respect to five types of vulnerabilities.</p></div>\",\"PeriodicalId\":51004,\"journal\":{\"name\":\"Computers & Security\",\"volume\":\"146 \",\"pages\":\"\"},\"PeriodicalIF\":4.8000,\"publicationDate\":\"2024-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers & Security\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S016740482400364X\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Security","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016740482400364X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

漏洞检测对提高软件系统的可靠性具有重要意义。虽然现有方法在漏洞检测方面取得了显著成效，但也存在以下几个缺点：(1) 从具有高噪声比的源代码中剔除无关信息，从而利用深度学习方法并设计出具有高准确性的实验。然而，基于深度学习的检测方法需要大规模的数据集。这给小型软件系统的漏洞检测带来了计算上的困难。(2）大多数研究通过处理漏洞提交来进行特征选择。尽管做了大量的工作，但利用源代码检测漏洞的工作还很少。为了解决这两个问题，本研究提出了一种新型标签和漏洞检测算法。该算法首先借助关键字漏洞矩阵检测源代码。然后，通过 word2vec 生成最终编码矩阵，从而将标签向量与源代码矩阵结合起来，为广义线性模型（GLM）提供可训练的数据集。与之前的研究不同，我们的方法不需要漏洞提交，而是使用源代码来进行漏洞检测。除此之外，类似的研究通常只针对一种编程语言提出复杂的解决方案。相反，我们的研究为 C#、Java 和 C++ 等三种编程语言开发了漏洞关键字，并通过关键字矩阵创建了相关的标记向量。在大多数实验数据集上，所提出的方法都优于基线方法，曲线下面积（AUC）超过 90%。此外，就五类漏洞而言，我们的方法与其他方法在召回率、精确率和 F1 分数上平均相差 7.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A cosine similarity-based labeling technique for vulnerability type detection using source codes

Vulnerability detection is of great importance in providing reliability to software systems. Although existing methods achieve remarkable success in vulnerability detection, they have several disadvantages as follows: (1) The irrelevant information is removed from source codes, which have a high noise ratio, thereby utilizing deep learning methods and devising experiments featuring high accuracy. However, deep learning-based detection methods necessitate large-scale datasets. This results in computational hardship with respect to vulnerability detection in small-scale software systems. (2) The majority of the studies perform feature selection by processing vulnerability commits. Despite tremendous endeavors, there are few works detecting vulnerability with source codes. To solve these two problems, in this study, a novel labeling and vulnerability detection algorithm is proposed. The algorithm first exploits source codes with the help of a keyword vulnerability matrix. After that, an ultimate encoded matrix is generated by word2vec, thereby combining the labeling vector with the source code matrix to reveal a trainable dataset for a generalized linear model (GLM). Different from preceding studies, our method performs vulnerability detection without requiring vulnerability commits but using source codes. In addition to this, similar studies generally aim to bring sophisticated solutions for just one type of programming language. Conversely, our study develops vulnerability keywords for three programming languages including C#, Java, and C++, and creates the related labeling vectors by regarding the keyword matrix. The proposed method outperformed the baseline approaches for most of the experimental datasets with over 90% of the area under the curve (AUC). Further, there is a 7.7% margin between our method and the alternatives on average for Recall, Precision, and F1-score with respect to five types of vulnerabilities.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computers & Security 工程技术-计算机：信息系统

CiteScore

12.40

自引率

7.10%

发文量

365

审稿时长

10.7 months

期刊介绍： Computers & Security is the most respected technical journal in the IT security field. With its high-profile editorial board and informative regular features and columns, the journal is essential reading for IT security professionals around the world. Computers & Security provides you with a unique blend of leading edge research and sound practical management advice. It is aimed at the professional involved with computer security, audit, control and data integrity in all sectors - industry, commerce and academia. Recognized worldwide as THE primary source of reference for applied research and technical expertise it is your first step to fully secure systems.