基于本福德定律的基于距离的特征选择用于恶意软件检测

IF 5.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Computers & Security Pub Date : 2025-08-22 DOI:10.1016/j.cose.2025.104625

Pedro Fernandes , Séamus Ó Ciardhuáin , Mário Antunes

{"title":"基于本福德定律的基于距离的特征选择用于恶意软件检测","authors":"Pedro Fernandes , Séamus Ó Ciardhuáin , Mário Antunes","doi":"10.1016/j.cose.2025.104625","DOIUrl":null,"url":null,"abstract":"<div><div>Detecting malware in computer networks and data streams from Android devices remains a critical challenge for cybersecurity researchers. While machine learning and deep learning techniques have shown promising results, these approaches often require large volumes of labelled data, offer limited interpretability, and struggle to adapt to sophisticated threats such as zero-day attacks. Moreover, their high computational requirements restrict their applicability in resource-constrained environments.</div><div>This research proposes an innovative approach that advances the state of the art by offering practical solutions for dynamic and data-limited security scenarios. By integrating natural statistical laws, particularly Benford’s law, with dissimilarity functions, a lightweight, fast, and scalable model is developed that eliminates the need for extensive training and large labelled datasets while improving resilience to data imbalance and scalability for large-scale cybersecurity applications.</div><div>Although Benford’s law has demonstrated potential in anomaly detection, its effectiveness is limited by the difficulty of selecting relevant features. To overcome this, the study combines Benford’s law with several distance functions, including Median Absolute Deviation, Kullback–Leibler divergence, Euclidean distance, and Pearson correlation, enabling statistically grounded feature selection. Additional metrics, such as the Kolmogorov test, Jensen–Shannon divergence, and Z statistics, were used for model validation.</div><div>This approach quantifies discrepancies between expected and observed distributions, addressing classic feature selection challenges like redundancy and imbalance. Validated on both balanced and unbalanced datasets, the model achieved strong results: 88.30% accuracy and 85.08% F1-score in the balanced set, 92.75% accuracy and 95.29% F1-score in the unbalanced set. The integration of Benford’s law with distance functions significantly reduced false positives and negatives.</div><div>Compared to traditional Machine Learning methods, which typically require extensive training and large datasets to achieve F1 scores between 92% and 99%, the proposed approach delivers competitive performance while enhancing computational efficiency, robustness, and interpretability. This balance makes it a practical and scalable alternative for real-time or resource-constrained cybersecurity environments.</div></div>","PeriodicalId":51004,"journal":{"name":"Computers & Security","volume":"158 ","pages":"Article 104625"},"PeriodicalIF":5.4000,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distance-based feature selection using Benford’s law for malware detection\",\"authors\":\"Pedro Fernandes , Séamus Ó Ciardhuáin , Mário Antunes\",\"doi\":\"10.1016/j.cose.2025.104625\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Detecting malware in computer networks and data streams from Android devices remains a critical challenge for cybersecurity researchers. While machine learning and deep learning techniques have shown promising results, these approaches often require large volumes of labelled data, offer limited interpretability, and struggle to adapt to sophisticated threats such as zero-day attacks. Moreover, their high computational requirements restrict their applicability in resource-constrained environments.</div><div>This research proposes an innovative approach that advances the state of the art by offering practical solutions for dynamic and data-limited security scenarios. By integrating natural statistical laws, particularly Benford’s law, with dissimilarity functions, a lightweight, fast, and scalable model is developed that eliminates the need for extensive training and large labelled datasets while improving resilience to data imbalance and scalability for large-scale cybersecurity applications.</div><div>Although Benford’s law has demonstrated potential in anomaly detection, its effectiveness is limited by the difficulty of selecting relevant features. To overcome this, the study combines Benford’s law with several distance functions, including Median Absolute Deviation, Kullback–Leibler divergence, Euclidean distance, and Pearson correlation, enabling statistically grounded feature selection. Additional metrics, such as the Kolmogorov test, Jensen–Shannon divergence, and Z statistics, were used for model validation.</div><div>This approach quantifies discrepancies between expected and observed distributions, addressing classic feature selection challenges like redundancy and imbalance. Validated on both balanced and unbalanced datasets, the model achieved strong results: 88.30% accuracy and 85.08% F1-score in the balanced set, 92.75% accuracy and 95.29% F1-score in the unbalanced set. The integration of Benford’s law with distance functions significantly reduced false positives and negatives.</div><div>Compared to traditional Machine Learning methods, which typically require extensive training and large datasets to achieve F1 scores between 92% and 99%, the proposed approach delivers competitive performance while enhancing computational efficiency, robustness, and interpretability. This balance makes it a practical and scalable alternative for real-time or resource-constrained cybersecurity environments.</div></div>\",\"PeriodicalId\":51004,\"journal\":{\"name\":\"Computers & Security\",\"volume\":\"158 \",\"pages\":\"Article 104625\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-08-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers & Security\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167404825003141\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Security","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167404825003141","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

检测计算机网络中的恶意软件和来自安卓设备的数据流仍然是网络安全研究人员面临的一个关键挑战。虽然机器学习和深度学习技术已经显示出有希望的结果，但这些方法通常需要大量标记数据，提供有限的可解释性，并且难以适应零日攻击等复杂威胁。此外，它们的高计算要求限制了它们在资源受限环境中的适用性。这项研究提出了一种创新的方法，通过为动态和数据有限的安全场景提供实用的解决方案，提高了目前的技术水平。通过将自然统计定律（特别是本福德定律）与不相似函数相结合，开发了一种轻量级、快速和可扩展的模型，消除了对大量训练和大型标记数据集的需求，同时提高了对数据不平衡的弹性和大规模网络安全应用的可扩展性。尽管本福德定律在异常检测中显示出潜力，但其有效性受到相关特征选择困难的限制。为了克服这个问题，该研究将本福德定律与几个距离函数结合起来，包括中位数绝对偏差、Kullback-Leibler散度、欧几里得距离和Pearson相关性，从而实现基于统计的特征选择。额外的度量，如Kolmogorov检验、Jensen-Shannon散度和Z统计量，被用于模型验证。这种方法量化了预期分布和观察分布之间的差异，解决了冗余和不平衡等经典特征选择挑战。在平衡和非平衡数据集上验证，该模型取得了较好的结果：平衡集准确率为88.30%，F1-score为85.08%；非平衡集准确率为92.75%，F1-score为95.29%。本福德定律与距离函数的集成显著减少了假阳性和假阴性。传统的机器学习方法通常需要大量的训练和大型数据集才能达到92%到99%的F1分数，与之相比，该方法在提高计算效率、鲁棒性和可解释性的同时，提供了具有竞争力的性能。这种平衡使其成为实时或资源受限的网络安全环境的实用且可扩展的替代方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Distance-based feature selection using Benford’s law for malware detection

Detecting malware in computer networks and data streams from Android devices remains a critical challenge for cybersecurity researchers. While machine learning and deep learning techniques have shown promising results, these approaches often require large volumes of labelled data, offer limited interpretability, and struggle to adapt to sophisticated threats such as zero-day attacks. Moreover, their high computational requirements restrict their applicability in resource-constrained environments.

This research proposes an innovative approach that advances the state of the art by offering practical solutions for dynamic and data-limited security scenarios. By integrating natural statistical laws, particularly Benford’s law, with dissimilarity functions, a lightweight, fast, and scalable model is developed that eliminates the need for extensive training and large labelled datasets while improving resilience to data imbalance and scalability for large-scale cybersecurity applications.

Although Benford’s law has demonstrated potential in anomaly detection, its effectiveness is limited by the difficulty of selecting relevant features. To overcome this, the study combines Benford’s law with several distance functions, including Median Absolute Deviation, Kullback–Leibler divergence, Euclidean distance, and Pearson correlation, enabling statistically grounded feature selection. Additional metrics, such as the Kolmogorov test, Jensen–Shannon divergence, and Z statistics, were used for model validation.

This approach quantifies discrepancies between expected and observed distributions, addressing classic feature selection challenges like redundancy and imbalance. Validated on both balanced and unbalanced datasets, the model achieved strong results: 88.30% accuracy and 85.08% F1-score in the balanced set, 92.75% accuracy and 95.29% F1-score in the unbalanced set. The integration of Benford’s law with distance functions significantly reduced false positives and negatives.

Compared to traditional Machine Learning methods, which typically require extensive training and large datasets to achieve F1 scores between 92% and 99%, the proposed approach delivers competitive performance while enhancing computational efficiency, robustness, and interpretability. This balance makes it a practical and scalable alternative for real-time or resource-constrained cybersecurity environments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computers & Security 工程技术-计算机：信息系统

CiteScore

12.40

自引率

7.10%

发文量

365

审稿时长

10.7 months

期刊介绍： Computers & Security is the most respected technical journal in the IT security field. With its high-profile editorial board and informative regular features and columns, the journal is essential reading for IT security professionals around the world. Computers & Security provides you with a unique blend of leading edge research and sound practical management advice. It is aimed at the professional involved with computer security, audit, control and data integrity in all sectors - industry, commerce and academia. Recognized worldwide as THE primary source of reference for applied research and technical expertise it is your first step to fully secure systems.