{"title":"基于nlp的API序列恶意软件分类方法","authors":"T. Tran, Hiroshi Sato","doi":"10.1109/IESYS.2017.8233569","DOIUrl":null,"url":null,"abstract":"In the field of malware analysis, two basic types, which are static analysis and dynamic analysis, are involved in the process of understanding on how particular malware functions. By using dynamic analysis, malware researchers could collect API call sequences that are very valuable sources of information for identifying malware behavior. The proposed malware classification procedures introduced in this paper use API call sequences as inputs to classifiers. In addition, taking advantage of the development in Natural Language Processing field, we use some methods such as n-gram, doc2vec (or Paragraph vectors), TF-IDF to convert those API sequences to numeric vectors before feeding to the classifiers. Our proposed approaches are divided into 3 different methods to classify malware, that is TF-IDF, Paragraph Vector with Distributed Bag of Words and Paragraph Vector with Distributed Memory. Each of them provides us a very good accuracy.","PeriodicalId":429982,"journal":{"name":"2017 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":"{\"title\":\"NLP-based approaches for malware classification from API sequences\",\"authors\":\"T. Tran, Hiroshi Sato\",\"doi\":\"10.1109/IESYS.2017.8233569\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the field of malware analysis, two basic types, which are static analysis and dynamic analysis, are involved in the process of understanding on how particular malware functions. By using dynamic analysis, malware researchers could collect API call sequences that are very valuable sources of information for identifying malware behavior. The proposed malware classification procedures introduced in this paper use API call sequences as inputs to classifiers. In addition, taking advantage of the development in Natural Language Processing field, we use some methods such as n-gram, doc2vec (or Paragraph vectors), TF-IDF to convert those API sequences to numeric vectors before feeding to the classifiers. Our proposed approaches are divided into 3 different methods to classify malware, that is TF-IDF, Paragraph Vector with Distributed Bag of Words and Paragraph Vector with Distributed Memory. Each of them provides us a very good accuracy.\",\"PeriodicalId\":429982,\"journal\":{\"name\":\"2017 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES)\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"37\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IESYS.2017.8233569\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IESYS.2017.8233569","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
NLP-based approaches for malware classification from API sequences
In the field of malware analysis, two basic types, which are static analysis and dynamic analysis, are involved in the process of understanding on how particular malware functions. By using dynamic analysis, malware researchers could collect API call sequences that are very valuable sources of information for identifying malware behavior. The proposed malware classification procedures introduced in this paper use API call sequences as inputs to classifiers. In addition, taking advantage of the development in Natural Language Processing field, we use some methods such as n-gram, doc2vec (or Paragraph vectors), TF-IDF to convert those API sequences to numeric vectors before feeding to the classifiers. Our proposed approaches are divided into 3 different methods to classify malware, that is TF-IDF, Paragraph Vector with Distributed Bag of Words and Paragraph Vector with Distributed Memory. Each of them provides us a very good accuracy.