Enhanced DGA botnet domain detection and family classification via n-gram analysis and Hellinger distance

IF 4.6 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Computer Networks Pub Date : 2025-06-10 DOI:10.1016/j.comnet.2025.111415

Jungmin Lee , Sungju Yun , Nahyun Kim , Yeonjoon Lee

{"title":"Enhanced DGA botnet domain detection and family classification via n-gram analysis and Hellinger distance","authors":"Jungmin Lee , Sungju Yun , Nahyun Kim , Yeonjoon Lee","doi":"10.1016/j.comnet.2025.111415","DOIUrl":null,"url":null,"abstract":"<div><div>Bot masters spread malware to create botnets and use Domain Generation Algorithms (DGAs) to evade blacklist-based detection methods with numerous generated domains, posing a significant threat to network security. Since detection alone cannot halt malware operations, classifying DGA domains into their respective botnet families is essential for enabling targeted countermeasures and addressing vulnerabilities in infected systems. However, most existing approaches focus primarily on distinguishing DGA domains from legitimate ones and face challenges when classifying domains from DGA families with similar character distributions, highlighting the need for improved techniques. In response, we expand the focus to DGA family classification and conduct in-depth analyses using eXplainable Artificial Intelligence (XAI) techniques to explore the impact of n-grams on classification performance. These analyses reveal that n-gram preprocessing and Hellinger Distance (HD)-based features derived from n-gram probability distributions can significantly enhance classification performance. Building on these insights, we propose an integrated framework with two components, an N-gram-based Multi-scale One-Dimensional Convolutional Neural Network model (N-MODCNN) and a machine learning (ML) classifier utilizing HD features, for detecting and classifying DGA domains. N-MODCNN detects DGA domains from n-gram preprocessed inputs, and detected domains are classified into their respective botnet families by a soft ensemble approach that integrates predictions from N-MODCNN and the ML classifier, enabling robust and accurate classification. Experiments on recent public datasets show that our framework achieves up to 99% detection and classification accuracy. For families with similar character distributions, it achieves F1-scores exceeding 90%, representing improvements of up to 72 percentage points over existing methods.</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"269 ","pages":"Article 111415"},"PeriodicalIF":4.6000,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128625003822","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Bot masters spread malware to create botnets and use Domain Generation Algorithms (DGAs) to evade blacklist-based detection methods with numerous generated domains, posing a significant threat to network security. Since detection alone cannot halt malware operations, classifying DGA domains into their respective botnet families is essential for enabling targeted countermeasures and addressing vulnerabilities in infected systems. However, most existing approaches focus primarily on distinguishing DGA domains from legitimate ones and face challenges when classifying domains from DGA families with similar character distributions, highlighting the need for improved techniques. In response, we expand the focus to DGA family classification and conduct in-depth analyses using eXplainable Artificial Intelligence (XAI) techniques to explore the impact of n-grams on classification performance. These analyses reveal that n-gram preprocessing and Hellinger Distance (HD)-based features derived from n-gram probability distributions can significantly enhance classification performance. Building on these insights, we propose an integrated framework with two components, an N-gram-based Multi-scale One-Dimensional Convolutional Neural Network model (N-MODCNN) and a machine learning (ML) classifier utilizing HD features, for detecting and classifying DGA domains. N-MODCNN detects DGA domains from n-gram preprocessed inputs, and detected domains are classified into their respective botnet families by a soft ensemble approach that integrates predictions from N-MODCNN and the ML classifier, enabling robust and accurate classification. Experiments on recent public datasets show that our framework achieves up to 99% detection and classification accuracy. For families with similar character distributions, it achieves F1-scores exceeding 90%, representing improvements of up to 72 percentage points over existing methods.

查看原文本刊更多论文

基于n-gram分析和Hellinger距离的增强DGA僵尸网络域检测和家族分类

僵尸主机通过传播恶意软件创建僵尸网络，并利用域生成算法（DGAs）逃避基于黑名单的检测方法，生成了大量的域，对网络安全构成了重大威胁。由于检测本身不能阻止恶意软件的操作，因此将DGA域分类到各自的僵尸网络家族中对于实现有针对性的对策和解决受感染系统中的漏洞至关重要。然而，大多数现有方法主要侧重于区分DGA域和合法域，并且在对具有相似特征分布的DGA族的域进行分类时面临挑战，这突出了改进技术的必要性。作为回应，我们将重点扩展到DGA家族分类，并使用可解释人工智能（eXplainable Artificial Intelligence， XAI）技术进行深入分析，以探索n-grams对分类性能的影响。这些分析表明，n图预处理和基于n图概率分布的海灵格距离（HD）特征可以显著提高分类性能。基于这些见解，我们提出了一个包含两个组件的集成框架，一个基于n -gram的多尺度一维卷积神经网络模型（N-MODCNN）和一个利用高清特征的机器学习（ML）分类器，用于检测和分类DGA域。N-MODCNN从n-gram预处理输入中检测DGA域，并通过软集成方法将检测到的域分类到各自的僵尸网络家族中，该方法集成了N-MODCNN和ML分类器的预测，实现了鲁棒和准确的分类。在最近的公开数据集上的实验表明，我们的框架达到了99%的检测和分类准确率。对于具有相似特征分布的家庭，该方法的f1得分超过90%，比现有方法提高了72个百分点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Networks 工程技术-电信学

CiteScore

10.80

自引率

3.60%

发文量

434

审稿时长

8.6 months

期刊介绍： Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.