On the Effectiveness of Extracting Important Words from Proxy Logs

2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW) Pub Date : 2018-11-01 DOI:10.1109/CANDARW.2018.00084

M. Mimura

{"title":"On the Effectiveness of Extracting Important Words from Proxy Logs","authors":"M. Mimura","doi":"10.1109/CANDARW.2018.00084","DOIUrl":null,"url":null,"abstract":"Modern http-based malware imitates benign traffic to evade detection. To detect unseen malicious traffic, many methods using machine learning techniques have been proposed. These methods took advantage of the characteristic of malicious traffic, and usually require additional parameters which are not obtained from essential security devices such as a proxy server or IDS (Intrusion Detection System). Thus, most previous methods are not applicable to actual information systems. To tackle a realistic threat, a linguistic-based detection method for proxy logs has been proposed. This method extracts words as feature vectors automatically with natural language techniques, and discriminates between benign traffic and malicious traffic. The previous method generates a corpus from the whole extracted words which contain trivial words. To generate discriminative feature representation, a corpus has to be effectively summarized. This paper extracts important words from proxy logs to summarize the corpus. To define the word importance score, this paper uses term frequency and document frequency. Our method summarizes the corpus and improves the detection rate. We conducted cross-validation and timeline analysis with captured pcap files from Exploit Kit (EK) between 2014 and 2016. The experimental result shows that our method improves the accuracy. The best F-measure achieves 1.00 in the cross-validation and timeline analysis.","PeriodicalId":329439,"journal":{"name":"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CANDARW.2018.00084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Modern http-based malware imitates benign traffic to evade detection. To detect unseen malicious traffic, many methods using machine learning techniques have been proposed. These methods took advantage of the characteristic of malicious traffic, and usually require additional parameters which are not obtained from essential security devices such as a proxy server or IDS (Intrusion Detection System). Thus, most previous methods are not applicable to actual information systems. To tackle a realistic threat, a linguistic-based detection method for proxy logs has been proposed. This method extracts words as feature vectors automatically with natural language techniques, and discriminates between benign traffic and malicious traffic. The previous method generates a corpus from the whole extracted words which contain trivial words. To generate discriminative feature representation, a corpus has to be effectively summarized. This paper extracts important words from proxy logs to summarize the corpus. To define the word importance score, this paper uses term frequency and document frequency. Our method summarizes the corpus and improves the detection rate. We conducted cross-validation and timeline analysis with captured pcap files from Exploit Kit (EK) between 2014 and 2016. The experimental result shows that our method improves the accuracy. The best F-measure achieves 1.00 in the cross-validation and timeline analysis.

查看原文本刊更多论文

从代理日志中提取重要词的有效性研究

现代基于http的恶意软件模仿良性流量以逃避检测。为了检测看不见的恶意流量，已经提出了许多使用机器学习技术的方法。这些方法利用了恶意流量的特性，通常需要额外的参数，而这些参数不是从代理服务器或入侵检测系统(IDS)等基本安全设备获得的。因此，大多数以前的方法并不适用于实际的信息系统。为了解决实际威胁，提出了一种基于语言的代理日志检测方法。该方法利用自然语言技术自动提取词作为特征向量，并对良性流量和恶意流量进行区分。前一种方法从包含琐碎词的整个提取词中生成语料库。为了生成判别特征表示，必须对语料库进行有效的总结。本文从代理日志中提取重要词汇，对语料库进行总结。本文采用词频和文献频次来定义词的重要性分值。我们的方法总结了语料库，提高了检测率。我们对2014年至2016年间从Exploit Kit (EK)捕获的pcap文件进行了交叉验证和时间轴分析。实验结果表明，该方法提高了识别精度。交叉验证和时间轴分析的最佳f值达到1.00。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)

自引率

0.00%

发文量