{"title":"从代理日志中提取重要词的有效性研究","authors":"M. Mimura","doi":"10.1109/CANDARW.2018.00084","DOIUrl":null,"url":null,"abstract":"Modern http-based malware imitates benign traffic to evade detection. To detect unseen malicious traffic, many methods using machine learning techniques have been proposed. These methods took advantage of the characteristic of malicious traffic, and usually require additional parameters which are not obtained from essential security devices such as a proxy server or IDS (Intrusion Detection System). Thus, most previous methods are not applicable to actual information systems. To tackle a realistic threat, a linguistic-based detection method for proxy logs has been proposed. This method extracts words as feature vectors automatically with natural language techniques, and discriminates between benign traffic and malicious traffic. The previous method generates a corpus from the whole extracted words which contain trivial words. To generate discriminative feature representation, a corpus has to be effectively summarized. This paper extracts important words from proxy logs to summarize the corpus. To define the word importance score, this paper uses term frequency and document frequency. Our method summarizes the corpus and improves the detection rate. We conducted cross-validation and timeline analysis with captured pcap files from Exploit Kit (EK) between 2014 and 2016. The experimental result shows that our method improves the accuracy. The best F-measure achieves 1.00 in the cross-validation and timeline analysis.","PeriodicalId":329439,"journal":{"name":"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"On the Effectiveness of Extracting Important Words from Proxy Logs\",\"authors\":\"M. Mimura\",\"doi\":\"10.1109/CANDARW.2018.00084\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern http-based malware imitates benign traffic to evade detection. To detect unseen malicious traffic, many methods using machine learning techniques have been proposed. These methods took advantage of the characteristic of malicious traffic, and usually require additional parameters which are not obtained from essential security devices such as a proxy server or IDS (Intrusion Detection System). Thus, most previous methods are not applicable to actual information systems. To tackle a realistic threat, a linguistic-based detection method for proxy logs has been proposed. This method extracts words as feature vectors automatically with natural language techniques, and discriminates between benign traffic and malicious traffic. The previous method generates a corpus from the whole extracted words which contain trivial words. To generate discriminative feature representation, a corpus has to be effectively summarized. This paper extracts important words from proxy logs to summarize the corpus. To define the word importance score, this paper uses term frequency and document frequency. Our method summarizes the corpus and improves the detection rate. We conducted cross-validation and timeline analysis with captured pcap files from Exploit Kit (EK) between 2014 and 2016. The experimental result shows that our method improves the accuracy. The best F-measure achieves 1.00 in the cross-validation and timeline analysis.\",\"PeriodicalId\":329439,\"journal\":{\"name\":\"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CANDARW.2018.00084\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CANDARW.2018.00084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
On the Effectiveness of Extracting Important Words from Proxy Logs
Modern http-based malware imitates benign traffic to evade detection. To detect unseen malicious traffic, many methods using machine learning techniques have been proposed. These methods took advantage of the characteristic of malicious traffic, and usually require additional parameters which are not obtained from essential security devices such as a proxy server or IDS (Intrusion Detection System). Thus, most previous methods are not applicable to actual information systems. To tackle a realistic threat, a linguistic-based detection method for proxy logs has been proposed. This method extracts words as feature vectors automatically with natural language techniques, and discriminates between benign traffic and malicious traffic. The previous method generates a corpus from the whole extracted words which contain trivial words. To generate discriminative feature representation, a corpus has to be effectively summarized. This paper extracts important words from proxy logs to summarize the corpus. To define the word importance score, this paper uses term frequency and document frequency. Our method summarizes the corpus and improves the detection rate. We conducted cross-validation and timeline analysis with captured pcap files from Exploit Kit (EK) between 2014 and 2016. The experimental result shows that our method improves the accuracy. The best F-measure achieves 1.00 in the cross-validation and timeline analysis.