{"title":"Understanding and Mitigating Label Bias in Malware Classification: An Empirical Study","authors":"Jia Yan, Xiangkun Jia, Lingyun Ying, Purui Su","doi":"10.1109/QRS57517.2022.00057","DOIUrl":null,"url":null,"abstract":"Machine learning techniques are promising for malware classification, but there is a neglected problem of label bias in the annotation process which decreases the performance in practice. To understand the label bias problems and existing solutions, we conduct an empirical study based on two Portable Executable (PE) malware sample datasets (i.e., open-sourced BODMAS with 52,793 samples and a new collected MAIN dataset of 153,811 samples), and 67 anti-virus engines in VirusTotal. We first show the two ways of label bias problems, including chaotic naming rules and annotation inconsistency. Then we present the effects of two solutions (i.e., electing one reputable AV engine and aggregating multiple labels based on majority voting) and find they face the problems of feature preference and engine independence. Finally, we propose some recommendations for improvements and get a 7.79% increase in the F1 score (i.e., from 84.83% to 92.62%). The dataset will be open-source for further study.","PeriodicalId":143812,"journal":{"name":"2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/QRS57517.2022.00057","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Machine learning techniques are promising for malware classification, but there is a neglected problem of label bias in the annotation process which decreases the performance in practice. To understand the label bias problems and existing solutions, we conduct an empirical study based on two Portable Executable (PE) malware sample datasets (i.e., open-sourced BODMAS with 52,793 samples and a new collected MAIN dataset of 153,811 samples), and 67 anti-virus engines in VirusTotal. We first show the two ways of label bias problems, including chaotic naming rules and annotation inconsistency. Then we present the effects of two solutions (i.e., electing one reputable AV engine and aggregating multiple labels based on majority voting) and find they face the problems of feature preference and engine independence. Finally, we propose some recommendations for improvements and get a 7.79% increase in the F1 score (i.e., from 84.83% to 92.62%). The dataset will be open-source for further study.