Jungmin Lee , Sungju Yun , Nahyun Kim , Yeonjoon Lee
{"title":"Enhanced DGA botnet domain detection and family classification via n-gram analysis and Hellinger distance","authors":"Jungmin Lee , Sungju Yun , Nahyun Kim , Yeonjoon Lee","doi":"10.1016/j.comnet.2025.111415","DOIUrl":null,"url":null,"abstract":"<div><div>Bot masters spread malware to create botnets and use Domain Generation Algorithms (DGAs) to evade blacklist-based detection methods with numerous generated domains, posing a significant threat to network security. Since detection alone cannot halt malware operations, classifying DGA domains into their respective botnet families is essential for enabling targeted countermeasures and addressing vulnerabilities in infected systems. However, most existing approaches focus primarily on distinguishing DGA domains from legitimate ones and face challenges when classifying domains from DGA families with similar character distributions, highlighting the need for improved techniques. In response, we expand the focus to DGA family classification and conduct in-depth analyses using eXplainable Artificial Intelligence (XAI) techniques to explore the impact of n-grams on classification performance. These analyses reveal that n-gram preprocessing and Hellinger Distance (HD)-based features derived from n-gram probability distributions can significantly enhance classification performance. Building on these insights, we propose an integrated framework with two components, an N-gram-based Multi-scale One-Dimensional Convolutional Neural Network model (N-MODCNN) and a machine learning (ML) classifier utilizing HD features, for detecting and classifying DGA domains. N-MODCNN detects DGA domains from n-gram preprocessed inputs, and detected domains are classified into their respective botnet families by a soft ensemble approach that integrates predictions from N-MODCNN and the ML classifier, enabling robust and accurate classification. Experiments on recent public datasets show that our framework achieves up to 99% detection and classification accuracy. For families with similar character distributions, it achieves F1-scores exceeding 90%, representing improvements of up to 72 percentage points over existing methods.</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"269 ","pages":"Article 111415"},"PeriodicalIF":4.6000,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128625003822","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Bot masters spread malware to create botnets and use Domain Generation Algorithms (DGAs) to evade blacklist-based detection methods with numerous generated domains, posing a significant threat to network security. Since detection alone cannot halt malware operations, classifying DGA domains into their respective botnet families is essential for enabling targeted countermeasures and addressing vulnerabilities in infected systems. However, most existing approaches focus primarily on distinguishing DGA domains from legitimate ones and face challenges when classifying domains from DGA families with similar character distributions, highlighting the need for improved techniques. In response, we expand the focus to DGA family classification and conduct in-depth analyses using eXplainable Artificial Intelligence (XAI) techniques to explore the impact of n-grams on classification performance. These analyses reveal that n-gram preprocessing and Hellinger Distance (HD)-based features derived from n-gram probability distributions can significantly enhance classification performance. Building on these insights, we propose an integrated framework with two components, an N-gram-based Multi-scale One-Dimensional Convolutional Neural Network model (N-MODCNN) and a machine learning (ML) classifier utilizing HD features, for detecting and classifying DGA domains. N-MODCNN detects DGA domains from n-gram preprocessed inputs, and detected domains are classified into their respective botnet families by a soft ensemble approach that integrates predictions from N-MODCNN and the ML classifier, enabling robust and accurate classification. Experiments on recent public datasets show that our framework achieves up to 99% detection and classification accuracy. For families with similar character distributions, it achieves F1-scores exceeding 90%, representing improvements of up to 72 percentage points over existing methods.
期刊介绍:
Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.