Machine learning for tree structures in fake site detection

Proceedings of the 15th International Conference on Availability, Reliability and Security Pub Date : 2020-08-25 DOI:10.1145/3407023.3407035

Taichi Ishikawa, Yu-Lu Liu, D. Shepard, Kilho Shin

{"title":"Machine learning for tree structures in fake site detection","authors":"Taichi Ishikawa, Yu-Lu Liu, D. Shepard, Kilho Shin","doi":"10.1145/3407023.3407035","DOIUrl":null,"url":null,"abstract":"Tree data analysis has many applications in information security. In particular, HTML pages' DOM trees are an important target of analysis because web pages can be vectors for, and targets of, major cyberattacks like phishing. Previous attempts to incorporate tree data analysis into security applications, however, have been hampered by the lack of efficient methods for tree data analysis in machine learning. As such, most security research has focused on data representable as vectors of real numbers, like most machine learning work. Recent work, however, has yielded several efficiency break-throughs in tree analysis. One example is kernel methods, a methodological bridge that fills the gap between discretely-structured data (like trees) and multivariate analysis. Kernel methods enable applying a variety of multivariate analysis techniques such as SVM and PCA to trees. The method we are interested in is the subpath kernel. The subpath kernel offers the following advantages: (1) it is invariant over ordered and unordered trees; (2) it can be computed using an extremely fast linear-time algorithm compared to the quadratic time required to compute values of most tree kernels; (3) its excellent prediction accuracy has been proven through intensive experiments. This paper proposes a subpath kernel-based method for tree-structured security data. To demonstrate the effectiveness of our method, we apply it to the problem of detecting fake e-commerce sites, a sub-problem of phishing detection with a significant real-world financial cost. In an experiment on a real dataset of fake sites provided by a major e-commerce company, our method exhibited accuracy as high as 0.998 when training SVM with as few as 1,000 instances. Its generalization efficiency is also excellent: with only 100 training instances, the accuracy score reaches 0.996. While previous phishing detection methods relied on textual content, URL components, and blacklists, our approach is the first to leverage DOM trees, which makes it both more effective and more robust against adversarial attacks. Unlike URL or content changes, changing a page's DOM structure incurs large costs to criminals.","PeriodicalId":121225,"journal":{"name":"Proceedings of the 15th International Conference on Availability, Reliability and Security","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th International Conference on Availability, Reliability and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3407023.3407035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Tree data analysis has many applications in information security. In particular, HTML pages' DOM trees are an important target of analysis because web pages can be vectors for, and targets of, major cyberattacks like phishing. Previous attempts to incorporate tree data analysis into security applications, however, have been hampered by the lack of efficient methods for tree data analysis in machine learning. As such, most security research has focused on data representable as vectors of real numbers, like most machine learning work. Recent work, however, has yielded several efficiency break-throughs in tree analysis. One example is kernel methods, a methodological bridge that fills the gap between discretely-structured data (like trees) and multivariate analysis. Kernel methods enable applying a variety of multivariate analysis techniques such as SVM and PCA to trees. The method we are interested in is the subpath kernel. The subpath kernel offers the following advantages: (1) it is invariant over ordered and unordered trees; (2) it can be computed using an extremely fast linear-time algorithm compared to the quadratic time required to compute values of most tree kernels; (3) its excellent prediction accuracy has been proven through intensive experiments. This paper proposes a subpath kernel-based method for tree-structured security data. To demonstrate the effectiveness of our method, we apply it to the problem of detecting fake e-commerce sites, a sub-problem of phishing detection with a significant real-world financial cost. In an experiment on a real dataset of fake sites provided by a major e-commerce company, our method exhibited accuracy as high as 0.998 when training SVM with as few as 1,000 instances. Its generalization efficiency is also excellent: with only 100 training instances, the accuracy score reaches 0.996. While previous phishing detection methods relied on textual content, URL components, and blacklists, our approach is the first to leverage DOM trees, which makes it both more effective and more robust against adversarial attacks. Unlike URL or content changes, changing a page's DOM structure incurs large costs to criminals.

查看原文本刊更多论文

假站点检测中树形结构的机器学习

树形数据分析在信息安全领域有着广泛的应用。特别是，HTML页面的DOM树是一个重要的分析目标，因为网页可以是网络钓鱼等主要网络攻击的载体和目标。然而，由于机器学习中缺乏有效的树数据分析方法，先前将树数据分析纳入安全应用的尝试受到了阻碍。因此，大多数安全研究都集中在可表示为实数向量的数据上，就像大多数机器学习工作一样。然而，最近的工作在树分析方面取得了一些效率突破。一个例子是核方法，它是填补离散结构数据(如树)和多变量分析之间空白的方法桥梁。核方法可以应用多种多元分析技术，如支持向量机和主成分分析。我们感兴趣的方法是子路径内核。子路径内核具有以下优点:(1)它在有序树和无序树上是不变的;(2)与计算大多数树核值所需的二次时间相比，它可以使用极快的线性时间算法进行计算;(3)通过大量的实验证明了其良好的预测精度。提出了一种基于子路径核的树状结构安全数据提取方法。为了证明我们的方法的有效性，我们将其应用于检测假冒电子商务网站的问题，这是网络钓鱼检测的一个子问题，在现实世界中具有重大的财务成本。在某大型电子商务公司提供的虚假网站真实数据集上的实验中，我们的方法在训练支持向量机(SVM)只有1000个实例的情况下，准确率高达0.998。它的泛化效率也很好，在仅100个训练实例的情况下，准确率得分达到0.996。虽然以前的网络钓鱼检测方法依赖于文本内容、URL组件和黑名单，但我们的方法是第一个利用DOM树的方法，这使得它在对抗对抗性攻击时更有效、更健壮。与更改URL或内容不同，更改页面的DOM结构会给犯罪分子带来巨大的成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 15th International Conference on Availability, Reliability and Security

自引率

0.00%

发文量