NLabel: An Accurate Familial Clustering Framework for Large-scale Weakly-labeled Malware

Yannan Liu, Yabin Lai, Kaizhi Wei, Liang Gu, Zhengzheng Yan
{"title":"NLabel: An Accurate Familial Clustering Framework for Large-scale Weakly-labeled Malware","authors":"Yannan Liu, Yabin Lai, Kaizhi Wei, Liang Gu, Zhengzheng Yan","doi":"10.1109/TrustCom50675.2020.00039","DOIUrl":null,"url":null,"abstract":"Automatic family labeling for malware is in demand, especially for today's malware scale. While business Anti-Virus engines provide an efficient family labeling method, the raw labels tend to be inconsistent. Prior works mitigate such inconsistency by detecting the aliases and majority voting to obtain the final family label. However, these methods solve the inconsistency in a coarse-grained and vulnerable manner, and the obtained family label is inaccurate sometimes. In this work, we propose NLabel to conduct familial clustering based on AV engines' raw labels. On the one hand, NLabel uses word embedding techniques to capture the similarity among raw labels, transform the inconsistent labels of the same family into similar semantic representations, and mitigate the inconsistency at finer granularity. On the other hand, we propose a hierarchical family clustering method to boost the performance of large-scale data sets. Experimental results show that our method outperforms the SOTA.","PeriodicalId":221956,"journal":{"name":"2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TrustCom50675.2020.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Automatic family labeling for malware is in demand, especially for today's malware scale. While business Anti-Virus engines provide an efficient family labeling method, the raw labels tend to be inconsistent. Prior works mitigate such inconsistency by detecting the aliases and majority voting to obtain the final family label. However, these methods solve the inconsistency in a coarse-grained and vulnerable manner, and the obtained family label is inaccurate sometimes. In this work, we propose NLabel to conduct familial clustering based on AV engines' raw labels. On the one hand, NLabel uses word embedding techniques to capture the similarity among raw labels, transform the inconsistent labels of the same family into similar semantic representations, and mitigate the inconsistency at finer granularity. On the other hand, we propose a hierarchical family clustering method to boost the performance of large-scale data sets. Experimental results show that our method outperforms the SOTA.
NLabel:大规模弱标记恶意软件的精确家族聚类框架
对恶意软件的自动家族标记是有需求的,特别是对于今天的恶意软件规模。虽然商业反病毒引擎提供了一种有效的家族标签方法,但原始标签往往不一致。先前的工作通过检测别名和多数投票来获得最终的家族标签来缓解这种不一致。然而,这些方法解决不一致性的方法都是粗粒度的、易受攻击的,而且有时得到的家族标签也不准确。在这项工作中,我们提出了NLabel基于AV引擎的原始标签进行家族聚类。一方面,NLabel利用词嵌入技术捕获原始标签之间的相似性,将同族不一致的标签转化为相似的语义表示,并在更细的粒度上缓解不一致。另一方面,我们提出了一种层次族聚类方法来提高大规模数据集的性能。实验结果表明,该方法优于SOTA算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信