Christopher Molloy;Jeremy Banks;Steven H. H. Ding;Furkan Alaca;Philippe Charland;Andrew Walenstein
{"title":"Mecha: A Neural-Symbolic Open-Set Homogeneous Decision Fusion Approach for Zero-Day Malware Similarity Detection","authors":"Christopher Molloy;Jeremy Banks;Steven H. H. Ding;Furkan Alaca;Philippe Charland;Andrew Walenstein","doi":"10.1109/TSE.2025.3531210","DOIUrl":null,"url":null,"abstract":"With increasing numbers of novel malware each year, tools are required for efficient and accurate variant matching under the same family, for the purpose of effective proactive threat detection, retro-hunting, and attack campaign tracking. All of the state-of-the-art Deep Learning (DL) approaches assume that the incoming samples originate from known families and incorrectly identify novel families. Additionally, most of the existing solutions that leverage the Siamese Neural Network architecture either rely on pair-wise comparisons or computationally expensive preprocessing steps that are not scalable to a real-world malware triage volume requirement. We propose a different route, Mecha, a Neural-Symbolic Machine Learning (ML) system for malware variant matching and zero-day family detection. Mecha is comprised of an embedding network trained in two different scenarios for byte string embedding and an open-set approximate nearest neighbour algorithm for variant matching and zero-day detection. Our embedding network uses triplet loss for embedding generation and reinforcement-based Expectation Maximization (EM) learning for full deployment optimization. We conduct multiple in-sample and out-of-sample experiments to demonstrate the model's generalizability toward novel variants and families. We also show that Mecha can detect samples outside the known set of malware samples with an accuracy greater than 0.990.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"621-637"},"PeriodicalIF":6.5000,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10847580","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10847580/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
With increasing numbers of novel malware each year, tools are required for efficient and accurate variant matching under the same family, for the purpose of effective proactive threat detection, retro-hunting, and attack campaign tracking. All of the state-of-the-art Deep Learning (DL) approaches assume that the incoming samples originate from known families and incorrectly identify novel families. Additionally, most of the existing solutions that leverage the Siamese Neural Network architecture either rely on pair-wise comparisons or computationally expensive preprocessing steps that are not scalable to a real-world malware triage volume requirement. We propose a different route, Mecha, a Neural-Symbolic Machine Learning (ML) system for malware variant matching and zero-day family detection. Mecha is comprised of an embedding network trained in two different scenarios for byte string embedding and an open-set approximate nearest neighbour algorithm for variant matching and zero-day detection. Our embedding network uses triplet loss for embedding generation and reinforcement-based Expectation Maximization (EM) learning for full deployment optimization. We conduct multiple in-sample and out-of-sample experiments to demonstrate the model's generalizability toward novel variants and families. We also show that Mecha can detect samples outside the known set of malware samples with an accuracy greater than 0.990.
期刊介绍:
IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include:
a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models.
b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects.
c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards.
d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues.
e) System issues: Hardware-software trade-offs.
f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.