If at first you don't succeed, trie, trie again: Correcting TLSH scalability claims for large-dataset malware forensics

IF 2.2 4区医学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Forensic Science International-Digital Investigation Pub Date : 2025-07-01 DOI:10.1016/j.fsidi.2025.301922

Jordi Gonzalez

引用次数: 0

Abstract

Malware analysts use Trend Micro Locality-Sensitive Hashing (TLSH) for malware similarity computation, nearest-neighbor search, and related tasks like clustering and family classification. Although TLSH scales better than many alternatives, technical limitations have limited its application to larger datasets. Using the Lean 4 proof assistant, I formalized bounds on the properties of TLSH most relevant to its scalability and identified flaws in prior TLSH nearest-neighbor search algorithms. I leveraged these formal results to design correct acceleration structures for TLSH nearest-neighbor queries. On typical analyst workloads, these structures performed one to two orders of magnitude faster than the prior state-of-the-art, allowing analysts to use datasets at least an order of magnitude larger than what was previously feasible with the same computational resources. I make all code and data publicly available.

查看原文本刊更多论文

如果一开始你没有成功，尝试，再尝试：纠正大数据集恶意软件取证的TLSH可伸缩性声明

恶意软件分析师使用趋势科技位置敏感散列（TLSH）进行恶意软件相似度计算、最近邻搜索以及聚类和家族分类等相关任务。尽管TLSH的可伸缩性比许多替代方案好，但技术限制限制了它在更大数据集上的应用。使用Lean 4证明助手，我形式化了与TLSH可伸缩性最相关的属性界限，并确定了先前TLSH最近邻搜索算法中的缺陷。我利用这些正式结果为TLSH最近邻查询设计正确的加速结构。在典型的分析师工作负载上，这些结构的执行速度比以前的最先进技术快一到两个数量级，允许分析师使用的数据集至少比以前在相同计算资源下可行的数据集大一个数量级。我将所有代码和数据公开。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊