NLabel:大规模弱标记恶意软件的精确家族聚类框架

2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) Pub Date : 2020-12-01 DOI:10.1109/TrustCom50675.2020.00039

Yannan Liu, Yabin Lai, Kaizhi Wei, Liang Gu, Zhengzheng Yan

{"title":"NLabel:大规模弱标记恶意软件的精确家族聚类框架","authors":"Yannan Liu, Yabin Lai, Kaizhi Wei, Liang Gu, Zhengzheng Yan","doi":"10.1109/TrustCom50675.2020.00039","DOIUrl":null,"url":null,"abstract":"Automatic family labeling for malware is in demand, especially for today's malware scale. While business Anti-Virus engines provide an efficient family labeling method, the raw labels tend to be inconsistent. Prior works mitigate such inconsistency by detecting the aliases and majority voting to obtain the final family label. However, these methods solve the inconsistency in a coarse-grained and vulnerable manner, and the obtained family label is inaccurate sometimes. In this work, we propose NLabel to conduct familial clustering based on AV engines' raw labels. On the one hand, NLabel uses word embedding techniques to capture the similarity among raw labels, transform the inconsistent labels of the same family into similar semantic representations, and mitigate the inconsistency at finer granularity. On the other hand, we propose a hierarchical family clustering method to boost the performance of large-scale data sets. Experimental results show that our method outperforms the SOTA.","PeriodicalId":221956,"journal":{"name":"2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"NLabel: An Accurate Familial Clustering Framework for Large-scale Weakly-labeled Malware\",\"authors\":\"Yannan Liu, Yabin Lai, Kaizhi Wei, Liang Gu, Zhengzheng Yan\",\"doi\":\"10.1109/TrustCom50675.2020.00039\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic family labeling for malware is in demand, especially for today's malware scale. While business Anti-Virus engines provide an efficient family labeling method, the raw labels tend to be inconsistent. Prior works mitigate such inconsistency by detecting the aliases and majority voting to obtain the final family label. However, these methods solve the inconsistency in a coarse-grained and vulnerable manner, and the obtained family label is inaccurate sometimes. In this work, we propose NLabel to conduct familial clustering based on AV engines' raw labels. On the one hand, NLabel uses word embedding techniques to capture the similarity among raw labels, transform the inconsistent labels of the same family into similar semantic representations, and mitigate the inconsistency at finer granularity. On the other hand, we propose a hierarchical family clustering method to boost the performance of large-scale data sets. Experimental results show that our method outperforms the SOTA.\",\"PeriodicalId\":221956,\"journal\":{\"name\":\"2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)\",\"volume\":\"60 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TrustCom50675.2020.00039\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TrustCom50675.2020.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

对恶意软件的自动家族标记是有需求的，特别是对于今天的恶意软件规模。虽然商业反病毒引擎提供了一种有效的家族标签方法，但原始标签往往不一致。先前的工作通过检测别名和多数投票来获得最终的家族标签来缓解这种不一致。然而，这些方法解决不一致性的方法都是粗粒度的、易受攻击的，而且有时得到的家族标签也不准确。在这项工作中，我们提出了NLabel基于AV引擎的原始标签进行家族聚类。一方面，NLabel利用词嵌入技术捕获原始标签之间的相似性，将同族不一致的标签转化为相似的语义表示，并在更细的粒度上缓解不一致。另一方面，我们提出了一种层次族聚类方法来提高大规模数据集的性能。实验结果表明，该方法优于SOTA算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

NLabel: An Accurate Familial Clustering Framework for Large-scale Weakly-labeled Malware

Automatic family labeling for malware is in demand, especially for today's malware scale. While business Anti-Virus engines provide an efficient family labeling method, the raw labels tend to be inconsistent. Prior works mitigate such inconsistency by detecting the aliases and majority voting to obtain the final family label. However, these methods solve the inconsistency in a coarse-grained and vulnerable manner, and the obtained family label is inaccurate sometimes. In this work, we propose NLabel to conduct familial clustering based on AV engines' raw labels. On the one hand, NLabel uses word embedding techniques to capture the similarity among raw labels, transform the inconsistent labels of the same family into similar semantic representations, and mitigate the inconsistency at finer granularity. On the other hand, we propose a hierarchical family clustering method to boost the performance of large-scale data sets. Experimental results show that our method outperforms the SOTA.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)

自引率

0.00%

发文量