On the Effectiveness of Extracting Important Words from Proxy Logs

M. Mimura
{"title":"On the Effectiveness of Extracting Important Words from Proxy Logs","authors":"M. Mimura","doi":"10.1109/CANDARW.2018.00084","DOIUrl":null,"url":null,"abstract":"Modern http-based malware imitates benign traffic to evade detection. To detect unseen malicious traffic, many methods using machine learning techniques have been proposed. These methods took advantage of the characteristic of malicious traffic, and usually require additional parameters which are not obtained from essential security devices such as a proxy server or IDS (Intrusion Detection System). Thus, most previous methods are not applicable to actual information systems. To tackle a realistic threat, a linguistic-based detection method for proxy logs has been proposed. This method extracts words as feature vectors automatically with natural language techniques, and discriminates between benign traffic and malicious traffic. The previous method generates a corpus from the whole extracted words which contain trivial words. To generate discriminative feature representation, a corpus has to be effectively summarized. This paper extracts important words from proxy logs to summarize the corpus. To define the word importance score, this paper uses term frequency and document frequency. Our method summarizes the corpus and improves the detection rate. We conducted cross-validation and timeline analysis with captured pcap files from Exploit Kit (EK) between 2014 and 2016. The experimental result shows that our method improves the accuracy. The best F-measure achieves 1.00 in the cross-validation and timeline analysis.","PeriodicalId":329439,"journal":{"name":"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CANDARW.2018.00084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Modern http-based malware imitates benign traffic to evade detection. To detect unseen malicious traffic, many methods using machine learning techniques have been proposed. These methods took advantage of the characteristic of malicious traffic, and usually require additional parameters which are not obtained from essential security devices such as a proxy server or IDS (Intrusion Detection System). Thus, most previous methods are not applicable to actual information systems. To tackle a realistic threat, a linguistic-based detection method for proxy logs has been proposed. This method extracts words as feature vectors automatically with natural language techniques, and discriminates between benign traffic and malicious traffic. The previous method generates a corpus from the whole extracted words which contain trivial words. To generate discriminative feature representation, a corpus has to be effectively summarized. This paper extracts important words from proxy logs to summarize the corpus. To define the word importance score, this paper uses term frequency and document frequency. Our method summarizes the corpus and improves the detection rate. We conducted cross-validation and timeline analysis with captured pcap files from Exploit Kit (EK) between 2014 and 2016. The experimental result shows that our method improves the accuracy. The best F-measure achieves 1.00 in the cross-validation and timeline analysis.
从代理日志中提取重要词的有效性研究
现代基于http的恶意软件模仿良性流量以逃避检测。为了检测看不见的恶意流量,已经提出了许多使用机器学习技术的方法。这些方法利用了恶意流量的特性,通常需要额外的参数,而这些参数不是从代理服务器或入侵检测系统(IDS)等基本安全设备获得的。因此,大多数以前的方法并不适用于实际的信息系统。为了解决实际威胁,提出了一种基于语言的代理日志检测方法。该方法利用自然语言技术自动提取词作为特征向量,并对良性流量和恶意流量进行区分。前一种方法从包含琐碎词的整个提取词中生成语料库。为了生成判别特征表示,必须对语料库进行有效的总结。本文从代理日志中提取重要词汇,对语料库进行总结。本文采用词频和文献频次来定义词的重要性分值。我们的方法总结了语料库,提高了检测率。我们对2014年至2016年间从Exploit Kit (EK)捕获的pcap文件进行了交叉验证和时间轴分析。实验结果表明,该方法提高了识别精度。交叉验证和时间轴分析的最佳f值达到1.00。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信