{"title":"Classification of computer viruses from binary code using ensemble classifier and recursive feature elimination","authors":"Prasit Usaphapanus, K. Piromsopa","doi":"10.1109/ICDIM.2017.8244670","DOIUrl":null,"url":null,"abstract":"This paper proposes a supervised machine learning model for detecting (unseen) viruses files. Our main focus is on static analysis approach. To find the best method, we experiment with difference types of feature extraction and three classifier algorithms including extreme gradient boosting, random forest and multilayer perceptron. Our data set contains 6,319 executable files. Each file is extracted with objdump and sorted with TF-IDF score to find best features. The F1 score shows slightly better performance than those of the baselines. Random forest with 20 attributes yields 0.9379758 F1 score which is 0.0316167 more than that of the baseline. The extreme gradient boosting method with 500 attributes achieve 0.9628991 F1 score, 0.0418642 more than that of the baseline. We conclude that our approach can improve the precision and recall of the classification.","PeriodicalId":144953,"journal":{"name":"2017 Twelfth International Conference on Digital Information Management (ICDIM)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Twelfth International Conference on Digital Information Management (ICDIM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDIM.2017.8244670","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
This paper proposes a supervised machine learning model for detecting (unseen) viruses files. Our main focus is on static analysis approach. To find the best method, we experiment with difference types of feature extraction and three classifier algorithms including extreme gradient boosting, random forest and multilayer perceptron. Our data set contains 6,319 executable files. Each file is extracted with objdump and sorted with TF-IDF score to find best features. The F1 score shows slightly better performance than those of the baselines. Random forest with 20 attributes yields 0.9379758 F1 score which is 0.0316167 more than that of the baseline. The extreme gradient boosting method with 500 attributes achieve 0.9628991 F1 score, 0.0418642 more than that of the baseline. We conclude that our approach can improve the precision and recall of the classification.