Better Malware Ground Truth: Techniques for Weighting Anti-Virus Vendor Labels

Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security Pub Date : 2015-10-16 DOI:10.1145/2808769.2808780

Alex Kantchelian, Michael Carl Tschantz, Sadia Afroz, Brad Miller, Vaishaal Shankar, Rekha Bachwani, A. Joseph, J. D. Tygar

{"title":"Better Malware Ground Truth: Techniques for Weighting Anti-Virus Vendor Labels","authors":"Alex Kantchelian, Michael Carl Tschantz, Sadia Afroz, Brad Miller, Vaishaal Shankar, Rekha Bachwani, A. Joseph, J. D. Tygar","doi":"10.1145/2808769.2808780","DOIUrl":null,"url":null,"abstract":"We examine the problem of aggregating the results of multiple anti-virus (AV) vendors' detectors into a single authoritative ground-truth label for every binary. To do so, we adapt a well-known generative Bayesian model that postulates the existence of a hidden ground truth upon which the AV labels depend. We use training based on Expectation Maximization for this fully unsupervised technique. We evaluate our method using 279,327 distinct binaries from VirusTotal, each of which appeared for the first time between January 2012 and June 2014. Our evaluation shows that our statistical model is consistently more accurate at predicting the future-derived ground truth than all unweighted rules of the form \"k out of n\" AV detections. In addition, we evaluate the scenario where partial ground truth is available for model building. We train a logistic regression predictor on the partial label information. Our results show that as few as a 100 randomly selected training instances with ground truth are enough to achieve 80% true positive rate for 0.1% false positive rate. In comparison, the best unweighted threshold rule provides only 60% true positive rate at the same false positive rate.","PeriodicalId":426614,"journal":{"name":"Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"96","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2808769.2808780","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 96

Abstract

We examine the problem of aggregating the results of multiple anti-virus (AV) vendors' detectors into a single authoritative ground-truth label for every binary. To do so, we adapt a well-known generative Bayesian model that postulates the existence of a hidden ground truth upon which the AV labels depend. We use training based on Expectation Maximization for this fully unsupervised technique. We evaluate our method using 279,327 distinct binaries from VirusTotal, each of which appeared for the first time between January 2012 and June 2014. Our evaluation shows that our statistical model is consistently more accurate at predicting the future-derived ground truth than all unweighted rules of the form "k out of n" AV detections. In addition, we evaluate the scenario where partial ground truth is available for model building. We train a logistic regression predictor on the partial label information. Our results show that as few as a 100 randomly selected training instances with ground truth are enough to achieve 80% true positive rate for 0.1% false positive rate. In comparison, the best unweighted threshold rule provides only 60% true positive rate at the same false positive rate.

查看原文本刊更多论文

更好的恶意软件真相:反病毒厂商标签加权技术

我们研究了将多个反病毒(AV)供应商的检测器的结果聚合为每个二进制文件的单个权威真值标签的问题。为此，我们采用了一个著名的生成贝叶斯模型，该模型假设AV标签所依赖的隐藏基础真理的存在。对于这种完全无监督的技术，我们使用基于期望最大化的训练。我们使用VirusTotal的279,327个不同的二进制文件来评估我们的方法，每个二进制文件都是在2012年1月至2014年6月之间首次出现的。我们的评估表明，我们的统计模型在预测未来衍生的基础真理方面始终比“k out of n”AV检测形式的所有未加权规则更准确。此外，我们评估了部分地真值可用于模型构建的场景。我们在部分标签信息上训练逻辑回归预测器。我们的结果表明，只需100个随机选择的训练实例，就足以达到80%的真阳性率和0.1%的假阳性率。相比之下，在假阳性率相同的情况下，最佳的非加权阈值规则只能提供60%的真阳性率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security

自引率

0.00%

发文量