Taichi Ishikawa, Yu-Lu Liu, D. Shepard, Kilho Shin
{"title":"Machine learning for tree structures in fake site detection","authors":"Taichi Ishikawa, Yu-Lu Liu, D. Shepard, Kilho Shin","doi":"10.1145/3407023.3407035","DOIUrl":null,"url":null,"abstract":"Tree data analysis has many applications in information security. In particular, HTML pages' DOM trees are an important target of analysis because web pages can be vectors for, and targets of, major cyberattacks like phishing. Previous attempts to incorporate tree data analysis into security applications, however, have been hampered by the lack of efficient methods for tree data analysis in machine learning. As such, most security research has focused on data representable as vectors of real numbers, like most machine learning work. Recent work, however, has yielded several efficiency break-throughs in tree analysis. One example is kernel methods, a methodological bridge that fills the gap between discretely-structured data (like trees) and multivariate analysis. Kernel methods enable applying a variety of multivariate analysis techniques such as SVM and PCA to trees. The method we are interested in is the subpath kernel. The subpath kernel offers the following advantages: (1) it is invariant over ordered and unordered trees; (2) it can be computed using an extremely fast linear-time algorithm compared to the quadratic time required to compute values of most tree kernels; (3) its excellent prediction accuracy has been proven through intensive experiments. This paper proposes a subpath kernel-based method for tree-structured security data. To demonstrate the effectiveness of our method, we apply it to the problem of detecting fake e-commerce sites, a sub-problem of phishing detection with a significant real-world financial cost. In an experiment on a real dataset of fake sites provided by a major e-commerce company, our method exhibited accuracy as high as 0.998 when training SVM with as few as 1,000 instances. Its generalization efficiency is also excellent: with only 100 training instances, the accuracy score reaches 0.996. While previous phishing detection methods relied on textual content, URL components, and blacklists, our approach is the first to leverage DOM trees, which makes it both more effective and more robust against adversarial attacks. Unlike URL or content changes, changing a page's DOM structure incurs large costs to criminals.","PeriodicalId":121225,"journal":{"name":"Proceedings of the 15th International Conference on Availability, Reliability and Security","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th International Conference on Availability, Reliability and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3407023.3407035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Tree data analysis has many applications in information security. In particular, HTML pages' DOM trees are an important target of analysis because web pages can be vectors for, and targets of, major cyberattacks like phishing. Previous attempts to incorporate tree data analysis into security applications, however, have been hampered by the lack of efficient methods for tree data analysis in machine learning. As such, most security research has focused on data representable as vectors of real numbers, like most machine learning work. Recent work, however, has yielded several efficiency break-throughs in tree analysis. One example is kernel methods, a methodological bridge that fills the gap between discretely-structured data (like trees) and multivariate analysis. Kernel methods enable applying a variety of multivariate analysis techniques such as SVM and PCA to trees. The method we are interested in is the subpath kernel. The subpath kernel offers the following advantages: (1) it is invariant over ordered and unordered trees; (2) it can be computed using an extremely fast linear-time algorithm compared to the quadratic time required to compute values of most tree kernels; (3) its excellent prediction accuracy has been proven through intensive experiments. This paper proposes a subpath kernel-based method for tree-structured security data. To demonstrate the effectiveness of our method, we apply it to the problem of detecting fake e-commerce sites, a sub-problem of phishing detection with a significant real-world financial cost. In an experiment on a real dataset of fake sites provided by a major e-commerce company, our method exhibited accuracy as high as 0.998 when training SVM with as few as 1,000 instances. Its generalization efficiency is also excellent: with only 100 training instances, the accuracy score reaches 0.996. While previous phishing detection methods relied on textual content, URL components, and blacklists, our approach is the first to leverage DOM trees, which makes it both more effective and more robust against adversarial attacks. Unlike URL or content changes, changing a page's DOM structure incurs large costs to criminals.