{"title":"基于随机森林的恶意软件检测模型的性能维护","authors":"Colin Galen, Robert Steele","doi":"10.1109/UEMCON51285.2020.9298068","DOIUrl":null,"url":null,"abstract":"It has been recognized that machine learning-based malware detection models, trained on features statically extractable from binary executable files, offer a number of promising benefits, such as the ability to detect malware that has not been previously encountered and an ability to re-train and adapt over time as threats evolve. Nevertheless, many academic studies of machine learning-based malware detection consider and evaluate performance on datasets that do not evolve with time, although it is recognized in practice that malware detection models will necessarily deteriorate in performance over time due to the emergence of novel malware threats. In this study, we make use of a large dataset comprised of the features extracted from malware/goodware executable samples in the very common Portable Executable (PE) format, that are orderable by time of first appearance, to analyze the deterioration of machine learning-based malware detection models over time from training. Of the large number of models we trained and then evaluated on later occurring subsets of the dataset, we note the relative strength of Random Forest to maintain predictive performance into the future. We then consider in greater depth, Random Forest-based models for malware detection, considering Random Forest hyperparameter choices to achieve better maintenance of performance and discuss the significance of the findings for PE malware detection approaches.","PeriodicalId":433609,"journal":{"name":"2020 11th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Performance Maintenance Over Time of Random Forest-based Malware Detection Models\",\"authors\":\"Colin Galen, Robert Steele\",\"doi\":\"10.1109/UEMCON51285.2020.9298068\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"It has been recognized that machine learning-based malware detection models, trained on features statically extractable from binary executable files, offer a number of promising benefits, such as the ability to detect malware that has not been previously encountered and an ability to re-train and adapt over time as threats evolve. Nevertheless, many academic studies of machine learning-based malware detection consider and evaluate performance on datasets that do not evolve with time, although it is recognized in practice that malware detection models will necessarily deteriorate in performance over time due to the emergence of novel malware threats. In this study, we make use of a large dataset comprised of the features extracted from malware/goodware executable samples in the very common Portable Executable (PE) format, that are orderable by time of first appearance, to analyze the deterioration of machine learning-based malware detection models over time from training. Of the large number of models we trained and then evaluated on later occurring subsets of the dataset, we note the relative strength of Random Forest to maintain predictive performance into the future. We then consider in greater depth, Random Forest-based models for malware detection, considering Random Forest hyperparameter choices to achieve better maintenance of performance and discuss the significance of the findings for PE malware detection approaches.\",\"PeriodicalId\":433609,\"journal\":{\"name\":\"2020 11th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 11th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/UEMCON51285.2020.9298068\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 11th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/UEMCON51285.2020.9298068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Performance Maintenance Over Time of Random Forest-based Malware Detection Models
It has been recognized that machine learning-based malware detection models, trained on features statically extractable from binary executable files, offer a number of promising benefits, such as the ability to detect malware that has not been previously encountered and an ability to re-train and adapt over time as threats evolve. Nevertheless, many academic studies of machine learning-based malware detection consider and evaluate performance on datasets that do not evolve with time, although it is recognized in practice that malware detection models will necessarily deteriorate in performance over time due to the emergence of novel malware threats. In this study, we make use of a large dataset comprised of the features extracted from malware/goodware executable samples in the very common Portable Executable (PE) format, that are orderable by time of first appearance, to analyze the deterioration of machine learning-based malware detection models over time from training. Of the large number of models we trained and then evaluated on later occurring subsets of the dataset, we note the relative strength of Random Forest to maintain predictive performance into the future. We then consider in greater depth, Random Forest-based models for malware detection, considering Random Forest hyperparameter choices to achieve better maintenance of performance and discuss the significance of the findings for PE malware detection approaches.