{"title":"分析机器学习大数据集的新技术&对Android移动恶意软件数据集的简要回顾","authors":"Gürol Canbek, Ş. Sağiroğlu, Tugba Taskaya Temizel","doi":"10.1109/IBIGDELFT.2018.8625275","DOIUrl":null,"url":null,"abstract":"As the volume, variety, velocity aspects of big data are increasing, the other aspects such as veracity, value, variability, and venue could not be interpreted easily by data owners or researchers. The aspects are also unclear if the data is to be used in machine learning studies such as classification or clustering. This study proposes four techniques with fourteen criteria to systematically profile the datasets collected from different resources to distinguish from one another and see their strong and weak aspects. The proposed approach is demonstrated in five Android mobile malware datasets in the literature and in security industry namely Android Malware Genome Project, Drebin, Android Malware Dataset, Android Botnet, and Virus Total 2018. The results have shown that the proposed profiling methods reveal remarkable insight about the datasets comparatively and directs researchers to achieve big but more visible, qualitative, and internalized datasets.","PeriodicalId":290302,"journal":{"name":"2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"New Techniques in Profiling Big Datasets for Machine Learning with a Concise Review of Android Mobile Malware Datasets\",\"authors\":\"Gürol Canbek, Ş. Sağiroğlu, Tugba Taskaya Temizel\",\"doi\":\"10.1109/IBIGDELFT.2018.8625275\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the volume, variety, velocity aspects of big data are increasing, the other aspects such as veracity, value, variability, and venue could not be interpreted easily by data owners or researchers. The aspects are also unclear if the data is to be used in machine learning studies such as classification or clustering. This study proposes four techniques with fourteen criteria to systematically profile the datasets collected from different resources to distinguish from one another and see their strong and weak aspects. The proposed approach is demonstrated in five Android mobile malware datasets in the literature and in security industry namely Android Malware Genome Project, Drebin, Android Malware Dataset, Android Botnet, and Virus Total 2018. The results have shown that the proposed profiling methods reveal remarkable insight about the datasets comparatively and directs researchers to achieve big but more visible, qualitative, and internalized datasets.\",\"PeriodicalId\":290302,\"journal\":{\"name\":\"2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT)\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IBIGDELFT.2018.8625275\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IBIGDELFT.2018.8625275","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
New Techniques in Profiling Big Datasets for Machine Learning with a Concise Review of Android Mobile Malware Datasets
As the volume, variety, velocity aspects of big data are increasing, the other aspects such as veracity, value, variability, and venue could not be interpreted easily by data owners or researchers. The aspects are also unclear if the data is to be used in machine learning studies such as classification or clustering. This study proposes four techniques with fourteen criteria to systematically profile the datasets collected from different resources to distinguish from one another and see their strong and weak aspects. The proposed approach is demonstrated in five Android mobile malware datasets in the literature and in security industry namely Android Malware Genome Project, Drebin, Android Malware Dataset, Android Botnet, and Virus Total 2018. The results have shown that the proposed profiling methods reveal remarkable insight about the datasets comparatively and directs researchers to achieve big but more visible, qualitative, and internalized datasets.