An Autonomous Labeling Pipeline for Intrusion Detection on Enterprise Networks

Ravi K U Rakesh, Boda Ye, D. Roden, Catherine Beazley, Karan Gadiya, Brendan Abraham, Donald E. Brown, M. Veeraraghavan
{"title":"An Autonomous Labeling Pipeline for Intrusion Detection on Enterprise Networks","authors":"Ravi K U Rakesh, Boda Ye, D. Roden, Catherine Beazley, Karan Gadiya, Brendan Abraham, Donald E. Brown, M. Veeraraghavan","doi":"10.1109/SIEDS.2019.8735629","DOIUrl":null,"url":null,"abstract":"The volume of cyberattacks has grown exponentially over the last half-decade and shows no signs of slowing down. Additionally, attacks are rapidly evolving and are becoming increasingly more sophisticated. Cyber companies and academics alike have turned to machine learning to build models that learn data-driven rules for threat detection. However, these methods require a substantial amount of training data, and many enterprises lack the infrastructure to label their own network traffic for supervised learning. An added complexity to the labeling problem is that IP addresses are frequently reassigned to new hosts. In this paper, we lay a foundation for an autonomous traffic labeling pipeline that incorporates three different sources of ground truth and requires minimal manual intervention. We apply the labeling pipeline to network traffic data acquired from the University of Virginia. We process the network traffic with a popular network monitoring framework called Zeek, which provides aggregated statistics about the packets exchanged between a source and destination over a certain time interval. Additionally, the labeling pipeline synthesizes data from a network of honeypots compiled by the Duke STINGAR project, a series of nine blacklists, and a whitelist called Cisco Umbrella. We show, using cluster, port, and IP-location analyses, that a labeling methodology that ensembles the different data sources is better than one using only the individual sources. The labeling methodology proposed in the paper will aid enterprise network administrators in building robust intrusion detection systems.","PeriodicalId":265421,"journal":{"name":"2019 Systems and Information Engineering Design Symposium (SIEDS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS.2019.8735629","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The volume of cyberattacks has grown exponentially over the last half-decade and shows no signs of slowing down. Additionally, attacks are rapidly evolving and are becoming increasingly more sophisticated. Cyber companies and academics alike have turned to machine learning to build models that learn data-driven rules for threat detection. However, these methods require a substantial amount of training data, and many enterprises lack the infrastructure to label their own network traffic for supervised learning. An added complexity to the labeling problem is that IP addresses are frequently reassigned to new hosts. In this paper, we lay a foundation for an autonomous traffic labeling pipeline that incorporates three different sources of ground truth and requires minimal manual intervention. We apply the labeling pipeline to network traffic data acquired from the University of Virginia. We process the network traffic with a popular network monitoring framework called Zeek, which provides aggregated statistics about the packets exchanged between a source and destination over a certain time interval. Additionally, the labeling pipeline synthesizes data from a network of honeypots compiled by the Duke STINGAR project, a series of nine blacklists, and a whitelist called Cisco Umbrella. We show, using cluster, port, and IP-location analyses, that a labeling methodology that ensembles the different data sources is better than one using only the individual sources. The labeling methodology proposed in the paper will aid enterprise network administrators in building robust intrusion detection systems.
面向企业网络入侵检测的自主标注管道
在过去的五年里,网络攻击的数量呈指数级增长,没有任何放缓的迹象。此外,攻击正在迅速演变,并且变得越来越复杂。网络公司和学者都转向机器学习来构建模型,学习数据驱动的威胁检测规则。然而,这些方法需要大量的训练数据,而且许多企业缺乏基础设施来标记自己的网络流量以进行监督学习。标签问题的一个额外的复杂性是IP地址经常被重新分配给新的主机。在本文中,我们为自动交通标签管道奠定了基础,该管道包含三种不同的地面事实来源,并且需要最少的人工干预。我们将标记管道应用于从弗吉尼亚大学获得的网络流量数据。我们使用一个名为Zeek的流行网络监控框架来处理网络流量,该框架提供了在一定时间间隔内源和目标之间交换的数据包的汇总统计信息。此外,标签管道综合了来自杜克大学STINGAR项目编制的蜜罐网络、一系列9个黑名单和一个名为思科保护伞的白名单的数据。我们使用集群、端口和ip位置分析表明,集成不同数据源的标记方法优于仅使用单个数据源的标记方法。本文提出的标记方法将有助于企业网络管理员构建健壮的入侵检测系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信