A multi-dimensional DNS domain intelligence dataset for cybersecurity research

IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES

Data in Brief Pub Date : 2025-09-13 DOI:10.1016/j.dib.2025.112062

Radek Hranický, Ondřej Ondryáš, Adam Horák, Petr Pouč, Kamil Jeřábek, Tomáš Ebert, Jan Polišenský

{"title":"A multi-dimensional DNS domain intelligence dataset for cybersecurity research","authors":"Radek Hranický, Ondřej Ondryáš, Adam Horák, Petr Pouč, Kamil Jeřábek, Tomáš Ebert, Jan Polišenský","doi":"10.1016/j.dib.2025.112062","DOIUrl":null,"url":null,"abstract":"<div><div>The escalating sophistication and frequency of cyber threats require advanced solutions in cybersecurity research. Particularly, phishing and malware detection have become increasingly reliant on data-driven approaches. This paper presents a unique dataset precisely curated to bolster research in network security, focusing on the classification and analysis of internet domains. This dataset contains information for over a million internet domains with detailed labels distinguishing between phishing, malware, and benign traffic.</div><div>Our dataset is distinctive due to its comprehensive compilation of metainformation derived from multiple sources, including DNS records, TLS handshakes and certificates, WHOIS and RDAP services, IP-related data, and geolocation details. Such rich, multi-dimensional data allows for a deeper analysis and understanding of domain characteristics that are critical in identifying and categorizing cyber threats. The integration of information from diverse sources enhances the dataset's utility, providing a holistic view of each domain's footprint and its potential security implications.</div><div>The data is formatted in JSON, ensuring versatility, accessibility for researchers, and easy integration into various analytical tools and platforms, facilitating ease of use in statistical analysis, machine learning, and other computational analyses. Our dataset's extensive volume and variety surpass any known publicly available resources in this field, making it an invaluable asset for both academic and practical development and testing of cybersecurity solutions.</div><div>This paper thoroughly describes the value of the data, details the comprehensive methodology employed in the collection process, and provides a clear description of the data structure. Such documentation is crucial for ensuring that the dataset can be effectively utilized and reapplied in a variety of research contexts. Its structured format and the broad range of included features are critical for developing robust cybersecurity solutions and can be adapted for emerging threats.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"62 ","pages":"Article 112062"},"PeriodicalIF":1.4000,"publicationDate":"2025-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S235234092500784X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

The escalating sophistication and frequency of cyber threats require advanced solutions in cybersecurity research. Particularly, phishing and malware detection have become increasingly reliant on data-driven approaches. This paper presents a unique dataset precisely curated to bolster research in network security, focusing on the classification and analysis of internet domains. This dataset contains information for over a million internet domains with detailed labels distinguishing between phishing, malware, and benign traffic.

Our dataset is distinctive due to its comprehensive compilation of metainformation derived from multiple sources, including DNS records, TLS handshakes and certificates, WHOIS and RDAP services, IP-related data, and geolocation details. Such rich, multi-dimensional data allows for a deeper analysis and understanding of domain characteristics that are critical in identifying and categorizing cyber threats. The integration of information from diverse sources enhances the dataset's utility, providing a holistic view of each domain's footprint and its potential security implications.

The data is formatted in JSON, ensuring versatility, accessibility for researchers, and easy integration into various analytical tools and platforms, facilitating ease of use in statistical analysis, machine learning, and other computational analyses. Our dataset's extensive volume and variety surpass any known publicly available resources in this field, making it an invaluable asset for both academic and practical development and testing of cybersecurity solutions.

This paper thoroughly describes the value of the data, details the comprehensive methodology employed in the collection process, and provides a clear description of the data structure. Such documentation is crucial for ensuring that the dataset can be effectively utilized and reapplied in a variety of research contexts. Its structured format and the broad range of included features are critical for developing robust cybersecurity solutions and can be adapted for emerging threats.

查看原文本刊更多论文

面向网络安全研究的多维DNS域智能数据集

网络威胁的复杂性和频率不断上升，需要在网络安全研究中提供先进的解决方案。特别是，网络钓鱼和恶意软件检测越来越依赖于数据驱动的方法。本文提出了一个独特的数据集，旨在加强网络安全研究，重点是互联网域名的分类和分析。该数据集包含超过一百万个互联网域名的信息，并带有区分网络钓鱼、恶意软件和良性流量的详细标签。我们的数据集是独特的，因为它综合了来自多个来源的元信息，包括DNS记录、TLS握手和证书、WHOIS和RDAP服务、ip相关数据和地理位置细节。如此丰富、多维的数据允许对领域特征进行更深入的分析和理解，这些特征对于识别和分类网络威胁至关重要。来自不同来源的信息的集成增强了数据集的实用性，提供了每个域的足迹及其潜在安全影响的整体视图。数据格式为JSON，确保了研究人员的通用性，可访问性，并易于集成到各种分析工具和平台中，便于在统计分析，机器学习和其他计算分析中使用。我们的数据集的广泛数量和种类超过了该领域任何已知的公开可用资源，使其成为网络安全解决方案的学术和实际开发和测试的宝贵资产。本文全面描述了数据的价值，详细介绍了收集过程中采用的综合方法，并对数据结构进行了清晰的描述。这样的文档对于确保数据集可以在各种研究环境中有效地利用和重新应用是至关重要的。其结构化格式和广泛的包含功能对于开发强大的网络安全解决方案至关重要，并且可以适应新出现的威胁。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data in Brief MULTIDISCIPLINARY SCIENCES-

CiteScore

3.10

自引率

0.00%

发文量

996

审稿时长

70 days

期刊介绍： Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.