{"title":"Exploring the Long Tail of (Malicious) Software Downloads","authors":"Babak Rahbarinia, Marco Balduzzi, R. Perdisci","doi":"10.1109/DSN.2017.19","DOIUrl":null,"url":null,"abstract":"In this paper, we present a large-scale study of global trends in software download events, with an analysis of both benign and malicious downloads, and a categorization of events for which no ground truth is currently available. Our measurement study is based on a unique, real-world dataset collected at Trend Micro containing more than 3 million in-the-wild web-based software download events involving hundreds of thousands of Internet machines, collected over a period of seven months. Somewhat surprisingly, we found that despite our best efforts and the use of multiple sources of ground truth, more than 83% of all downloaded software files remain unknown, i.e. cannot be classified as benign or malicious, even two years after they were first observed. If we consider the number of machines that have downloaded at least one unknown file, we find that more than 69% of the entire machine/user population downloaded one or more unknown software file. Because the accuracy of malware detection systems reported in the academic literature is typically assessed only over software files that can be labeled, our findings raise concerns on their actual effectiveness in large-scale real-world deployments, and on their ability to defend the majority of Internet machines from infection. To better understand what these unknown software files may be, we perform a detailed analysis of their properties. We then explore whether it is possible to extend the labeling of software downloads by building a rule-based system that automatically learns from the available ground truth and can be used to identify many more benign and malicious files with very high confidence. This allows us to greatly expand the number of software files that can be labeled with high confidence, thus providing results that can benefit the evaluation of future malware detection systems.","PeriodicalId":426928,"journal":{"name":"2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2017.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13
Abstract
In this paper, we present a large-scale study of global trends in software download events, with an analysis of both benign and malicious downloads, and a categorization of events for which no ground truth is currently available. Our measurement study is based on a unique, real-world dataset collected at Trend Micro containing more than 3 million in-the-wild web-based software download events involving hundreds of thousands of Internet machines, collected over a period of seven months. Somewhat surprisingly, we found that despite our best efforts and the use of multiple sources of ground truth, more than 83% of all downloaded software files remain unknown, i.e. cannot be classified as benign or malicious, even two years after they were first observed. If we consider the number of machines that have downloaded at least one unknown file, we find that more than 69% of the entire machine/user population downloaded one or more unknown software file. Because the accuracy of malware detection systems reported in the academic literature is typically assessed only over software files that can be labeled, our findings raise concerns on their actual effectiveness in large-scale real-world deployments, and on their ability to defend the majority of Internet machines from infection. To better understand what these unknown software files may be, we perform a detailed analysis of their properties. We then explore whether it is possible to extend the labeling of software downloads by building a rule-based system that automatically learns from the available ground truth and can be used to identify many more benign and malicious files with very high confidence. This allows us to greatly expand the number of software files that can be labeled with high confidence, thus providing results that can benefit the evaluation of future malware detection systems.