Unsupervised Clustering for a Comparative Methodology of Machine Learning Models to Detect Domain-Generated Algorithms Based on an Alphanumeric Features Analysis

IF 3.9 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Network and Systems Management Pub Date : 2024-01-02 DOI:10.1007/s10922-023-09793-6

Mohamed Hassaoui, Mohamed Hanini, Said El Kafhali

{"title":"Unsupervised Clustering for a Comparative Methodology of Machine Learning Models to Detect Domain-Generated Algorithms Based on an Alphanumeric Features Analysis","authors":"Mohamed Hassaoui, Mohamed Hanini, Said El Kafhali","doi":"10.1007/s10922-023-09793-6","DOIUrl":null,"url":null,"abstract":"<p>Domain Generation Algorithms (DGAs) are often used for generating huge amounts of domain names to maintain command and control between the infected computer and the bot master. By establishing as needed a great number of domain names, attackers may mask their C2 servers and escape detection. Many malware families have switched to a stealthier contact approach. Therefore, the traditional methods become ineffective. Over the past decades, many researches have started to use artificial intelligence to create systems able to detect DGA in traffic, but these works do not use the same data to evaluate their models. This article proposes a comparative methodology to compare machine learning models based on unsupervised clustering and then applied this methodology to study the best models belonging to neural network methods and traditional machine learning methods to detect DGAs. We extracted 21 linguistic features based on the analysis of alphanumeric and n-gram, we studied the correlation between these features in order to reduce their number. We examine in detail those Machine learning algorithms and we discuss the drawbacks and strengths of each method with specific classes of DGA to propose a new switch case model that could be always reliable to detect DGAs.</p>","PeriodicalId":50119,"journal":{"name":"Journal of Network and Systems Management","volume":"10 1","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Network and Systems Management","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10922-023-09793-6","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Domain Generation Algorithms (DGAs) are often used for generating huge amounts of domain names to maintain command and control between the infected computer and the bot master. By establishing as needed a great number of domain names, attackers may mask their C2 servers and escape detection. Many malware families have switched to a stealthier contact approach. Therefore, the traditional methods become ineffective. Over the past decades, many researches have started to use artificial intelligence to create systems able to detect DGA in traffic, but these works do not use the same data to evaluate their models. This article proposes a comparative methodology to compare machine learning models based on unsupervised clustering and then applied this methodology to study the best models belonging to neural network methods and traditional machine learning methods to detect DGAs. We extracted 21 linguistic features based on the analysis of alphanumeric and n-gram, we studied the correlation between these features in order to reduce their number. We examine in detail those Machine learning algorithms and we discuss the drawbacks and strengths of each method with specific classes of DGA to propose a new switch case model that could be always reliable to detect DGAs.

Abstract Image

查看原文本刊更多论文

基于字母数字特征分析的机器学习模型无监督聚类比较方法，用于检测领域生成的算法

域名生成算法（DGA）通常用于生成大量域名，以维持受感染计算机与僵尸主控程序之间的指挥和控制。通过根据需要建立大量域名，攻击者可以掩盖其 C2 服务器并逃避检测。许多恶意软件家族已经转而采用更隐蔽的联系方法。因此，传统方法已经失效。过去几十年来，许多研究人员开始使用人工智能来创建能够检测流量中 DGA 的系统，但这些工作并没有使用相同的数据来评估其模型。本文提出了一种比较方法来比较基于无监督聚类的机器学习模型，然后应用这种方法来研究属于神经网络方法和传统机器学习方法的最佳模型，以检测 DGA。我们在分析字母数字和 n-gram 的基础上提取了 21 个语言特征，并研究了这些特征之间的相关性，以减少其数量。我们详细研究了这些机器学习算法，并针对特定类别的 DGA 讨论了每种方法的缺点和优点，从而提出了一种新的转换案例模型，该模型可以始终可靠地检测 DGA。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Network and Systems Management 工程技术-电信学

CiteScore

7.60

自引率

16.70%

发文量

审稿时长

>12 weeks

期刊介绍： Journal of Network and Systems Management, features peer-reviewed original research, as well as case studies in the fields of network and system management. The journal regularly disseminates significant new information on both the telecommunications and computing aspects of these fields, as well as their evolution and emerging integration. This outstanding quarterly covers architecture, analysis, design, software, standards, and migration issues related to the operation, management, and control of distributed systems and communication networks for voice, data, video, and networked computing.