Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan, Yang Xiang
{"title":"海报:从未标记项目中学习功能表示的漏洞发现","authors":"Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan, Yang Xiang","doi":"10.1145/3133956.3138840","DOIUrl":null,"url":null,"abstract":"In cybersecurity, vulnerability discovery in source code is a fundamental problem. To automate vulnerability discovery, Machine learning (ML) based techniques has attracted tremendous attention. However, existing ML-based techniques focus on the component or file level detection, and thus considerable human effort is still required to pinpoint the vulnerable code fragments. Using source code files also limit the generalisability of the ML models across projects. To address such challenges, this paper targets at the function-level vulnerability discovery in the cross-project scenario. A function representation learning method is proposed to obtain the high-level and generalizable function representations from the abstract syntax tree (AST). First, the serialized ASTs are used to learn project independence features. Then, a customized bi-directional LSTM neural network is devised to learn the sequential AST representations from the large number of raw features. The new function-level representation demonstrated promising performance gain, using a unique dataset where we manually labeled 6000+ functions from three open-source projects. The results confirm that the huge potential of the new AST-based function representation learning.","PeriodicalId":191367,"journal":{"name":"Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"84","resultStr":"{\"title\":\"POSTER: Vulnerability Discovery with Function Representation Learning from Unlabeled Projects\",\"authors\":\"Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan, Yang Xiang\",\"doi\":\"10.1145/3133956.3138840\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In cybersecurity, vulnerability discovery in source code is a fundamental problem. To automate vulnerability discovery, Machine learning (ML) based techniques has attracted tremendous attention. However, existing ML-based techniques focus on the component or file level detection, and thus considerable human effort is still required to pinpoint the vulnerable code fragments. Using source code files also limit the generalisability of the ML models across projects. To address such challenges, this paper targets at the function-level vulnerability discovery in the cross-project scenario. A function representation learning method is proposed to obtain the high-level and generalizable function representations from the abstract syntax tree (AST). First, the serialized ASTs are used to learn project independence features. Then, a customized bi-directional LSTM neural network is devised to learn the sequential AST representations from the large number of raw features. The new function-level representation demonstrated promising performance gain, using a unique dataset where we manually labeled 6000+ functions from three open-source projects. The results confirm that the huge potential of the new AST-based function representation learning.\",\"PeriodicalId\":191367,\"journal\":{\"name\":\"Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"84\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3133956.3138840\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3133956.3138840","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
POSTER: Vulnerability Discovery with Function Representation Learning from Unlabeled Projects
In cybersecurity, vulnerability discovery in source code is a fundamental problem. To automate vulnerability discovery, Machine learning (ML) based techniques has attracted tremendous attention. However, existing ML-based techniques focus on the component or file level detection, and thus considerable human effort is still required to pinpoint the vulnerable code fragments. Using source code files also limit the generalisability of the ML models across projects. To address such challenges, this paper targets at the function-level vulnerability discovery in the cross-project scenario. A function representation learning method is proposed to obtain the high-level and generalizable function representations from the abstract syntax tree (AST). First, the serialized ASTs are used to learn project independence features. Then, a customized bi-directional LSTM neural network is devised to learn the sequential AST representations from the large number of raw features. The new function-level representation demonstrated promising performance gain, using a unique dataset where we manually labeled 6000+ functions from three open-source projects. The results confirm that the huge potential of the new AST-based function representation learning.