{"title":"基于词法分析和机器学习的PHP Web应用程序漏洞检测","authors":"Dhika Rizki Anbiya, A. Purwarianti, Y. Asnar","doi":"10.1109/ICODSE.2018.8705809","DOIUrl":null,"url":null,"abstract":"Security is an important aspect and continues becoming a challenging topic especially in a web application. Today, 78,9% of websites uses PHP as programming languages. As a popular language, WebApps written in PHP tend to have many vulnerabilities and they are reflected from their source codes. Static analysis is a method that can be used to perform vulnerability detection in source codes. However, it usually requires an additional method that involves an expert knowledge. In this paper, we propose a vulnerability detection technique using lexical analysis with machine learning as a classification method. In this work, we focused on using PHP native token and Abstract Syntax Tree (AST) as features then manipulate them to get the best feature. We pruned the AST to dump some unusable nodes or subtrees and then extracted the node type token with Breadth First Search (BFS) algorithm. Moreover, unusable PHP token are filtered and also combined each other token to enrich the features extracted using TF-IDF. These features are used for classification in machine learning to find the best features between AST token and PHP token. The classification methods that we used were Gaussian Naïve Bayes (GNB), Support Vector Machine (SVM) and Decision Tree. As the result, we were able to get highest recall score at 92% with PHP token as features and Gaussian Naïve Bayes as machine learning classification method.","PeriodicalId":362422,"journal":{"name":"2018 5th International Conference on Data and Software Engineering (ICoDSE)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Vulnerability Detection in PHP Web Application Using Lexical Analysis Approach with Machine Learning\",\"authors\":\"Dhika Rizki Anbiya, A. Purwarianti, Y. Asnar\",\"doi\":\"10.1109/ICODSE.2018.8705809\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Security is an important aspect and continues becoming a challenging topic especially in a web application. Today, 78,9% of websites uses PHP as programming languages. As a popular language, WebApps written in PHP tend to have many vulnerabilities and they are reflected from their source codes. Static analysis is a method that can be used to perform vulnerability detection in source codes. However, it usually requires an additional method that involves an expert knowledge. In this paper, we propose a vulnerability detection technique using lexical analysis with machine learning as a classification method. In this work, we focused on using PHP native token and Abstract Syntax Tree (AST) as features then manipulate them to get the best feature. We pruned the AST to dump some unusable nodes or subtrees and then extracted the node type token with Breadth First Search (BFS) algorithm. Moreover, unusable PHP token are filtered and also combined each other token to enrich the features extracted using TF-IDF. These features are used for classification in machine learning to find the best features between AST token and PHP token. The classification methods that we used were Gaussian Naïve Bayes (GNB), Support Vector Machine (SVM) and Decision Tree. As the result, we were able to get highest recall score at 92% with PHP token as features and Gaussian Naïve Bayes as machine learning classification method.\",\"PeriodicalId\":362422,\"journal\":{\"name\":\"2018 5th International Conference on Data and Software Engineering (ICoDSE)\",\"volume\":\"142 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 5th International Conference on Data and Software Engineering (ICoDSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICODSE.2018.8705809\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 5th International Conference on Data and Software Engineering (ICoDSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICODSE.2018.8705809","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Vulnerability Detection in PHP Web Application Using Lexical Analysis Approach with Machine Learning
Security is an important aspect and continues becoming a challenging topic especially in a web application. Today, 78,9% of websites uses PHP as programming languages. As a popular language, WebApps written in PHP tend to have many vulnerabilities and they are reflected from their source codes. Static analysis is a method that can be used to perform vulnerability detection in source codes. However, it usually requires an additional method that involves an expert knowledge. In this paper, we propose a vulnerability detection technique using lexical analysis with machine learning as a classification method. In this work, we focused on using PHP native token and Abstract Syntax Tree (AST) as features then manipulate them to get the best feature. We pruned the AST to dump some unusable nodes or subtrees and then extracted the node type token with Breadth First Search (BFS) algorithm. Moreover, unusable PHP token are filtered and also combined each other token to enrich the features extracted using TF-IDF. These features are used for classification in machine learning to find the best features between AST token and PHP token. The classification methods that we used were Gaussian Naïve Bayes (GNB), Support Vector Machine (SVM) and Decision Tree. As the result, we were able to get highest recall score at 92% with PHP token as features and Gaussian Naïve Bayes as machine learning classification method.