Cross-domain meta-learning for bug finding in the source codes with a small dataset

Proceedings of the 2020 European Interdisciplinary Cybersecurity Conference Pub Date : 2020-11-18 DOI:10.1145/3424954.3424957

Jongho Shin

{"title":"Cross-domain meta-learning for bug finding in the source codes with a small dataset","authors":"Jongho Shin","doi":"10.1145/3424954.3424957","DOIUrl":null,"url":null,"abstract":"In terms of application security, detecting security vulnerabilities in prior and fixing them is one of the effective ways to prevent malicious activities. However, finding security bugs is highly reliant upon human experts due to its complexity. Therefore, source code auditing, one of the ways to find bugs, costs a lot, and the quality of auditing quite varies according to the performer. There have been many attempts to make automated systems for code auditing, but they have been suffered from huge false positives and false negatives. Meanwhile, machine learning technology is advancing dramatically in recent years, and it is outperforming humans in many tasks with high accuracy. Thus there have been lots of efforts to accommodate machine learning technology for security research. Most of the time, however, it is very difficult to obtain legitimate training data, and rarer often means more lethal in security. Therefore it is not easy to build reliable machine learning systems for security defects, and we are highly relying on human experts who can learn easily by a few examples. To overcome the obstacle, this paper proposes a deep neural network model for finding security bugs, which takes advantages of the recent developments in the machine learning technology; the language model adapted sub-word tokenization and self-attention based transformer from natural language processing for source code understanding, and a meta-learning technique from computer vision to overcome lack of legitimate vulnerability samples for the deep learning model. The model is also evaluated for finding DOM-based XSS bugs which is prevalent but hard to spot with traditional detection methods. The result shows that the model outperforms the baseline by 45% in the F1 score.","PeriodicalId":166844,"journal":{"name":"Proceedings of the 2020 European Interdisciplinary Cybersecurity Conference","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 European Interdisciplinary Cybersecurity Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3424954.3424957","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In terms of application security, detecting security vulnerabilities in prior and fixing them is one of the effective ways to prevent malicious activities. However, finding security bugs is highly reliant upon human experts due to its complexity. Therefore, source code auditing, one of the ways to find bugs, costs a lot, and the quality of auditing quite varies according to the performer. There have been many attempts to make automated systems for code auditing, but they have been suffered from huge false positives and false negatives. Meanwhile, machine learning technology is advancing dramatically in recent years, and it is outperforming humans in many tasks with high accuracy. Thus there have been lots of efforts to accommodate machine learning technology for security research. Most of the time, however, it is very difficult to obtain legitimate training data, and rarer often means more lethal in security. Therefore it is not easy to build reliable machine learning systems for security defects, and we are highly relying on human experts who can learn easily by a few examples. To overcome the obstacle, this paper proposes a deep neural network model for finding security bugs, which takes advantages of the recent developments in the machine learning technology; the language model adapted sub-word tokenization and self-attention based transformer from natural language processing for source code understanding, and a meta-learning technique from computer vision to overcome lack of legitimate vulnerability samples for the deep learning model. The model is also evaluated for finding DOM-based XSS bugs which is prevalent but hard to spot with traditional detection methods. The result shows that the model outperforms the baseline by 45% in the F1 score.

查看原文本刊更多论文

使用小数据集在源代码中查找bug的跨域元学习

在应用程序安全方面，提前发现安全漏洞并修复是防止恶意活动的有效方法之一。然而，由于其复杂性，查找安全漏洞高度依赖于人类专家。因此，作为发现bug的一种方法，源代码审计的成本很高，而且审计的质量因执行者而异。已经有很多尝试为代码审计创建自动化系统，但是它们遭受了大量的误报和误报。与此同时，近年来机器学习技术突飞猛进，在许多任务中，机器学习的准确率都超过了人类。因此，为了适应机器学习技术用于安全研究，人们做了很多努力。然而，大多数时候，获得合法的训练数据是非常困难的，而且越少往往意味着在安全方面越致命。因此，针对安全缺陷构建可靠的机器学习系统并不容易，我们高度依赖人类专家，他们可以通过几个例子轻松学习。为了克服这一障碍，本文利用机器学习技术的最新发展，提出了一种用于发现安全漏洞的深度神经网络模型;该语言模型采用自然语言处理中的子词标记化和基于自关注的转换器来理解源代码，并采用计算机视觉中的元学习技术来克服深度学习模型缺乏合法漏洞样本的问题。该模型还用于发现基于dom的XSS漏洞，这些漏洞普遍存在，但很难用传统的检测方法发现。结果表明，该模型在F1得分上优于基线45%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2020 European Interdisciplinary Cybersecurity Conference

自引率

0.00%

发文量