HAC-T and Fast Search for Similarity in Security

Jonathan J. Oliver, Muqeet Ali, Josiah Hagen
{"title":"HAC-T and Fast Search for Similarity in Security","authors":"Jonathan J. Oliver, Muqeet Ali, Josiah Hagen","doi":"10.1109/COINS49042.2020.9191381","DOIUrl":null,"url":null,"abstract":"Similarity digests have gained popularity for many security applications like blacklisting/whitelisting, and finding similar variants of malware. TLSH has been shown to be particularly good at hunting similar malware, and is resistant to evasion as compared to other similarity digests like ssdeep and sdhash. Searching and clustering are fundamental tools which help the security analysts and security operations center (SOC) operators in hunting and analyzing malware. Current approaches which aim to cluster malware are not scalable enough to keep up with the vast amount of malware and goodware available in the wild. In this paper, we present techniques which allow for fast search and clustering of TLSH hash digests which can aid analysts to inspect large amounts of malware/goodware. Our approach builds on fast nearest neighbor search techniques to build a tree-based index which performs fast search based on TLSH hash digests. The tree-based index is used in our threshold based Hierarchical Agglomerative Clustering (HAC-T) algorithm which is able to cluster digests in a scalable manner. Our clustering technique can cluster digests in O (n logn) time on average. We performed an empirical evaluation by comparing our approach with many standard and recent clustering techniques. We demonstrate that our approach is much more scalable and still is able to produce good cluster quality. We measured cluster quality using purity on 10 million samples obtained from VirusTotal. We obtained a high purity score in the range from 0.97 to 0.98 using labels from five major anti-virus vendors (Kaspersky, Microsoft, Symantec, Sophos, and McAfee) which demonstrates the effectiveness of the proposed method.","PeriodicalId":350108,"journal":{"name":"2020 International Conference on Omni-layer Intelligent Systems (COINS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Omni-layer Intelligent Systems (COINS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COINS49042.2020.9191381","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

Similarity digests have gained popularity for many security applications like blacklisting/whitelisting, and finding similar variants of malware. TLSH has been shown to be particularly good at hunting similar malware, and is resistant to evasion as compared to other similarity digests like ssdeep and sdhash. Searching and clustering are fundamental tools which help the security analysts and security operations center (SOC) operators in hunting and analyzing malware. Current approaches which aim to cluster malware are not scalable enough to keep up with the vast amount of malware and goodware available in the wild. In this paper, we present techniques which allow for fast search and clustering of TLSH hash digests which can aid analysts to inspect large amounts of malware/goodware. Our approach builds on fast nearest neighbor search techniques to build a tree-based index which performs fast search based on TLSH hash digests. The tree-based index is used in our threshold based Hierarchical Agglomerative Clustering (HAC-T) algorithm which is able to cluster digests in a scalable manner. Our clustering technique can cluster digests in O (n logn) time on average. We performed an empirical evaluation by comparing our approach with many standard and recent clustering techniques. We demonstrate that our approach is much more scalable and still is able to produce good cluster quality. We measured cluster quality using purity on 10 million samples obtained from VirusTotal. We obtained a high purity score in the range from 0.97 to 0.98 using labels from five major anti-virus vendors (Kaspersky, Microsoft, Symantec, Sophos, and McAfee) which demonstrates the effectiveness of the proposed method.
安全性中的HAC-T与快速相似性搜索
相似性摘要在许多安全应用程序(如黑名单/白名单)和查找恶意软件的类似变体中得到了普及。TLSH已被证明在寻找类似的恶意软件方面特别出色,并且与其他类似的消化方法(如ssdeep和shashh)相比,它具有抵抗逃避的能力。搜索和聚类是帮助安全分析师和安全运营中心(SOC)操作员查找和分析恶意软件的基本工具。目前旨在集群恶意软件的方法没有足够的可扩展性来跟上大量可用的恶意软件和好软件。在本文中,我们提出了允许快速搜索和聚类TLSH哈希摘要的技术,这可以帮助分析人员检查大量恶意软件/好软件。我们的方法建立在快速最近邻搜索技术的基础上,构建基于树的索引,该索引基于TLSH哈希摘要执行快速搜索。基于树的索引用于基于阈值的分层聚类(HAC-T)算法,该算法能够以可扩展的方式聚类摘要。我们的聚类技术平均可以在O (n logn)时间内对摘要进行聚类。我们通过将我们的方法与许多标准和最近的聚类技术进行比较来进行经验评估。我们证明了我们的方法具有更高的可扩展性,并且仍然能够产生良好的集群质量。我们使用从VirusTotal获得的1000万个样本的纯度来测量聚类质量。我们使用来自五个主要反病毒供应商(卡巴斯基、微软、赛门铁克、Sophos和McAfee)的标签获得了0.97到0.98的高纯度分数,这证明了所提出方法的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信