用于DGA分类的暹罗神经网络和机器学习

European Conference on Cyber Warfare and Security Pub Date : 2022-06-08 DOI:10.34190/eccws.21.1.205

L. Segurola, Telmo Egüés, Francesco Zola, Raul Orduna

{"title":"用于DGA分类的暹罗神经网络和机器学习","authors":"L. Segurola, Telmo Egüés, Francesco Zola, Raul Orduna","doi":"10.34190/eccws.21.1.205","DOIUrl":null,"url":null,"abstract":"Domain Generation Algorithms (DGA) are systems used to create immediate multiple and varying domain names. Such “artificial” domains can be then used for siting command and control servers which in turn oversee recruiting/infecting devices, and finally turning them into new resources to be exploited. In this sense, identifying DGA domain names can be crucial, to avoid cyberattacks like Phishing, Spam sending, Bitcoin mining, and many other. Usually, domain names generated by DGAs, are comprised by illegible character strings, but new “intelligent” DGAs tend to generate names using combination of words in dictionaries making its detection a challenging task. For this reason, in this work, we propose to address this problem using a combination of Machine Learning algorithms for improving the classification of DGAs domains. In particular, we propose to combine Siamese Neural Networks and traditional supervised Machine Learning algorithms in order to expand the input domain into separable n-dimensional data points and then achieve the domain classification. The proposed approach can be separated into 3 phases. In a first phase, domain names are encoded, by a one-hot encoder and a variation of this, named probabilistic one-hot encoder, which are implemented separately. Then, in the second phase, Long Short-Term Memory and Convolutional Siamese embedders are tested and compared. In particular, the first one is combined with the one-hot, while the Convolution algorithm is applied with the probabilistic one-hot encoded data. In the final step, five Machine Learning algorithms are tested using the two ways embedded data. Both embedder approaches reach very high results in terms of F1-score and Accuracy (about 91%) depending on the implemented classifier. The promising results obtained by the application of the proposed method shows that it is possible to perform DGA domain classification uniquely over the domain names, without considering external information such as DNS packets features.","PeriodicalId":258360,"journal":{"name":"European Conference on Cyber Warfare and Security","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Siamese Neural Network and Machine Learning for DGA Classification\",\"authors\":\"L. Segurola, Telmo Egüés, Francesco Zola, Raul Orduna\",\"doi\":\"10.34190/eccws.21.1.205\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Domain Generation Algorithms (DGA) are systems used to create immediate multiple and varying domain names. Such “artificial” domains can be then used for siting command and control servers which in turn oversee recruiting/infecting devices, and finally turning them into new resources to be exploited. In this sense, identifying DGA domain names can be crucial, to avoid cyberattacks like Phishing, Spam sending, Bitcoin mining, and many other. Usually, domain names generated by DGAs, are comprised by illegible character strings, but new “intelligent” DGAs tend to generate names using combination of words in dictionaries making its detection a challenging task. For this reason, in this work, we propose to address this problem using a combination of Machine Learning algorithms for improving the classification of DGAs domains. In particular, we propose to combine Siamese Neural Networks and traditional supervised Machine Learning algorithms in order to expand the input domain into separable n-dimensional data points and then achieve the domain classification. The proposed approach can be separated into 3 phases. In a first phase, domain names are encoded, by a one-hot encoder and a variation of this, named probabilistic one-hot encoder, which are implemented separately. Then, in the second phase, Long Short-Term Memory and Convolutional Siamese embedders are tested and compared. In particular, the first one is combined with the one-hot, while the Convolution algorithm is applied with the probabilistic one-hot encoded data. In the final step, five Machine Learning algorithms are tested using the two ways embedded data. Both embedder approaches reach very high results in terms of F1-score and Accuracy (about 91%) depending on the implemented classifier. The promising results obtained by the application of the proposed method shows that it is possible to perform DGA domain classification uniquely over the domain names, without considering external information such as DNS packets features.\",\"PeriodicalId\":258360,\"journal\":{\"name\":\"European Conference on Cyber Warfare and Security\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"European Conference on Cyber Warfare and Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.34190/eccws.21.1.205\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Conference on Cyber Warfare and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.34190/eccws.21.1.205","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

域名生成算法(Domain Generation Algorithms, DGA)是一种用于即时创建多个不同域名的系统。这样的“人工”域可以用来定位命令和控制服务器，这些服务器反过来监督招募/感染设备，并最终将它们转化为可供利用的新资源。从这个意义上说，识别DGA域名对于避免网络攻击(如网络钓鱼、垃圾邮件发送、比特币挖矿等)至关重要。通常，由DGAs生成的域名由难以辨认的字符串组成，但新的“智能”DGAs倾向于使用字典中的单词组合来生成名称，这使得其检测成为一项具有挑战性的任务。因此，在这项工作中，我们建议使用机器学习算法的组合来解决这个问题，以改进DGAs域的分类。特别地，我们提出将Siamese神经网络与传统的监督机器学习算法相结合，将输入域扩展为可分离的n维数据点，从而实现域分类。建议的方法可分为三个阶段。在第一阶段，域名被编码，由一个单热编码器和它的一个变体，称为概率单热编码器，分别实现。然后，在第二阶段，对长短期记忆和卷积暹罗嵌入器进行了测试和比较。其中，前者与1 -hot相结合，而卷积算法则应用于概率1 -hot编码数据。在最后一步中，使用两种嵌入数据的方式测试了五种机器学习算法。两种嵌入器方法在f1得分和准确率(约91%)方面都达到了非常高的结果，这取决于实现的分类器。应用该方法获得的良好结果表明，在不考虑DNS报文特征等外部信息的情况下，可以对域名进行唯一的DGA域分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Siamese Neural Network and Machine Learning for DGA Classification

Domain Generation Algorithms (DGA) are systems used to create immediate multiple and varying domain names. Such “artificial” domains can be then used for siting command and control servers which in turn oversee recruiting/infecting devices, and finally turning them into new resources to be exploited. In this sense, identifying DGA domain names can be crucial, to avoid cyberattacks like Phishing, Spam sending, Bitcoin mining, and many other. Usually, domain names generated by DGAs, are comprised by illegible character strings, but new “intelligent” DGAs tend to generate names using combination of words in dictionaries making its detection a challenging task. For this reason, in this work, we propose to address this problem using a combination of Machine Learning algorithms for improving the classification of DGAs domains. In particular, we propose to combine Siamese Neural Networks and traditional supervised Machine Learning algorithms in order to expand the input domain into separable n-dimensional data points and then achieve the domain classification. The proposed approach can be separated into 3 phases. In a first phase, domain names are encoded, by a one-hot encoder and a variation of this, named probabilistic one-hot encoder, which are implemented separately. Then, in the second phase, Long Short-Term Memory and Convolutional Siamese embedders are tested and compared. In particular, the first one is combined with the one-hot, while the Convolution algorithm is applied with the probabilistic one-hot encoded data. In the final step, five Machine Learning algorithms are tested using the two ways embedded data. Both embedder approaches reach very high results in terms of F1-score and Accuracy (about 91%) depending on the implemented classifier. The promising results obtained by the application of the proposed method shows that it is possible to perform DGA domain classification uniquely over the domain names, without considering external information such as DNS packets features.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

European Conference on Cyber Warfare and Security

自引率

0.00%

发文量