Cross-domain meta-learning for bug finding in the source codes with a small dataset

Jongho Shin
{"title":"Cross-domain meta-learning for bug finding in the source codes with a small dataset","authors":"Jongho Shin","doi":"10.1145/3424954.3424957","DOIUrl":null,"url":null,"abstract":"In terms of application security, detecting security vulnerabilities in prior and fixing them is one of the effective ways to prevent malicious activities. However, finding security bugs is highly reliant upon human experts due to its complexity. Therefore, source code auditing, one of the ways to find bugs, costs a lot, and the quality of auditing quite varies according to the performer. There have been many attempts to make automated systems for code auditing, but they have been suffered from huge false positives and false negatives. Meanwhile, machine learning technology is advancing dramatically in recent years, and it is outperforming humans in many tasks with high accuracy. Thus there have been lots of efforts to accommodate machine learning technology for security research. Most of the time, however, it is very difficult to obtain legitimate training data, and rarer often means more lethal in security. Therefore it is not easy to build reliable machine learning systems for security defects, and we are highly relying on human experts who can learn easily by a few examples. To overcome the obstacle, this paper proposes a deep neural network model for finding security bugs, which takes advantages of the recent developments in the machine learning technology; the language model adapted sub-word tokenization and self-attention based transformer from natural language processing for source code understanding, and a meta-learning technique from computer vision to overcome lack of legitimate vulnerability samples for the deep learning model. The model is also evaluated for finding DOM-based XSS bugs which is prevalent but hard to spot with traditional detection methods. The result shows that the model outperforms the baseline by 45% in the F1 score.","PeriodicalId":166844,"journal":{"name":"Proceedings of the 2020 European Interdisciplinary Cybersecurity Conference","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 European Interdisciplinary Cybersecurity Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3424954.3424957","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

In terms of application security, detecting security vulnerabilities in prior and fixing them is one of the effective ways to prevent malicious activities. However, finding security bugs is highly reliant upon human experts due to its complexity. Therefore, source code auditing, one of the ways to find bugs, costs a lot, and the quality of auditing quite varies according to the performer. There have been many attempts to make automated systems for code auditing, but they have been suffered from huge false positives and false negatives. Meanwhile, machine learning technology is advancing dramatically in recent years, and it is outperforming humans in many tasks with high accuracy. Thus there have been lots of efforts to accommodate machine learning technology for security research. Most of the time, however, it is very difficult to obtain legitimate training data, and rarer often means more lethal in security. Therefore it is not easy to build reliable machine learning systems for security defects, and we are highly relying on human experts who can learn easily by a few examples. To overcome the obstacle, this paper proposes a deep neural network model for finding security bugs, which takes advantages of the recent developments in the machine learning technology; the language model adapted sub-word tokenization and self-attention based transformer from natural language processing for source code understanding, and a meta-learning technique from computer vision to overcome lack of legitimate vulnerability samples for the deep learning model. The model is also evaluated for finding DOM-based XSS bugs which is prevalent but hard to spot with traditional detection methods. The result shows that the model outperforms the baseline by 45% in the F1 score.
使用小数据集在源代码中查找bug的跨域元学习
在应用程序安全方面,提前发现安全漏洞并修复是防止恶意活动的有效方法之一。然而,由于其复杂性,查找安全漏洞高度依赖于人类专家。因此,作为发现bug的一种方法,源代码审计的成本很高,而且审计的质量因执行者而异。已经有很多尝试为代码审计创建自动化系统,但是它们遭受了大量的误报和误报。与此同时,近年来机器学习技术突飞猛进,在许多任务中,机器学习的准确率都超过了人类。因此,为了适应机器学习技术用于安全研究,人们做了很多努力。然而,大多数时候,获得合法的训练数据是非常困难的,而且越少往往意味着在安全方面越致命。因此,针对安全缺陷构建可靠的机器学习系统并不容易,我们高度依赖人类专家,他们可以通过几个例子轻松学习。为了克服这一障碍,本文利用机器学习技术的最新发展,提出了一种用于发现安全漏洞的深度神经网络模型;该语言模型采用自然语言处理中的子词标记化和基于自关注的转换器来理解源代码,并采用计算机视觉中的元学习技术来克服深度学习模型缺乏合法漏洞样本的问题。该模型还用于发现基于dom的XSS漏洞,这些漏洞普遍存在,但很难用传统的检测方法发现。结果表明,该模型在F1得分上优于基线45%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信