{"title":"基于自建数据集GAN过采样和LGBM的移动恶意软件检测优化","authors":"Ortal Dayan, L. Wolf, Fang Wang, Yaniv Harel","doi":"10.1109/CSR57506.2023.10224927","DOIUrl":null,"url":null,"abstract":"The cyber detection industry focuses on analyzing the behavior of threats in order to develop IOCs and triggers. This process makes the detection always behind the attackers, as there is an analysis time between the attack tool launch and the detection ability. To address the challenges, a dedicated Sandbox environment was built, and thousands of mobile devices' samples were tested, resulted in creation of an up-to-date training dataset that is not based on the attacks analysis. With this dataset, the research focus was directed towards optimizing the AI methodology to achieve the best detection rates for a compromised mobile device. A CupolaGAN was implemented to oversample dataset and to compare results obtained from training LGBM models on both original imbalanced dataset and oversampled dataset. Classification scores on the oversampled data increase by maximum of 0.47+/-0.37%. The performance of the fine-tuned model using Optuna on the balanced data reaches 99.36+/-0.19% accuracy.","PeriodicalId":354918,"journal":{"name":"2023 IEEE International Conference on Cyber Security and Resilience (CSR)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimizing AI for Mobile Malware Detection by Self-Built-Dataset GAN Oversampling and LGBM\",\"authors\":\"Ortal Dayan, L. Wolf, Fang Wang, Yaniv Harel\",\"doi\":\"10.1109/CSR57506.2023.10224927\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The cyber detection industry focuses on analyzing the behavior of threats in order to develop IOCs and triggers. This process makes the detection always behind the attackers, as there is an analysis time between the attack tool launch and the detection ability. To address the challenges, a dedicated Sandbox environment was built, and thousands of mobile devices' samples were tested, resulted in creation of an up-to-date training dataset that is not based on the attacks analysis. With this dataset, the research focus was directed towards optimizing the AI methodology to achieve the best detection rates for a compromised mobile device. A CupolaGAN was implemented to oversample dataset and to compare results obtained from training LGBM models on both original imbalanced dataset and oversampled dataset. Classification scores on the oversampled data increase by maximum of 0.47+/-0.37%. The performance of the fine-tuned model using Optuna on the balanced data reaches 99.36+/-0.19% accuracy.\",\"PeriodicalId\":354918,\"journal\":{\"name\":\"2023 IEEE International Conference on Cyber Security and Resilience (CSR)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE International Conference on Cyber Security and Resilience (CSR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CSR57506.2023.10224927\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Cyber Security and Resilience (CSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSR57506.2023.10224927","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Optimizing AI for Mobile Malware Detection by Self-Built-Dataset GAN Oversampling and LGBM
The cyber detection industry focuses on analyzing the behavior of threats in order to develop IOCs and triggers. This process makes the detection always behind the attackers, as there is an analysis time between the attack tool launch and the detection ability. To address the challenges, a dedicated Sandbox environment was built, and thousands of mobile devices' samples were tested, resulted in creation of an up-to-date training dataset that is not based on the attacks analysis. With this dataset, the research focus was directed towards optimizing the AI methodology to achieve the best detection rates for a compromised mobile device. A CupolaGAN was implemented to oversample dataset and to compare results obtained from training LGBM models on both original imbalanced dataset and oversampled dataset. Classification scores on the oversampled data increase by maximum of 0.47+/-0.37%. The performance of the fine-tuned model using Optuna on the balanced data reaches 99.36+/-0.19% accuracy.