Internet traffic classification demystified: on the sources of the discriminative power

Yeon-sup Lim, Hyunchul Kim, Jiwoong Jeong, Chong-kwon Kim, T. Kwon, Yanghee Choi
{"title":"Internet traffic classification demystified: on the sources of the discriminative power","authors":"Yeon-sup Lim, Hyunchul Kim, Jiwoong Jeong, Chong-kwon Kim, T. Kwon, Yanghee Choi","doi":"10.1145/1921168.1921180","DOIUrl":null,"url":null,"abstract":"Recent research on Internet traffic classification has yield a number of data mining techniques for distinguishing types of traffic, but no systematic analysis on \"Why\" some algorithms achieve high accuracies. In pursuit of empirically grounded answers to the \"Why\" question, which is critical in understanding and establishing a scientific ground for traffic classification research, this paper reveals the three sources of the discriminative power in classifying the Internet application traffic: (i) ports, (ii) the sizes of the first one-two (for UDP flows) or four-five (for TCP flows) packets, and (iii) discretization of those features. We find that C4.5 performs the best under any circumstances, as well as the reason why; because the algorithm discretizes input features during classification operations. We also find that the entropy-based Minimum Description Length discretization on ports and packet size features substantially improve the classification accuracy of every machine learning algorithm tested (by as much as 59.8%!) and make all of them achieve >93% accuracy on average without any algorithm-specific tuning processes. Our results indicate that dealing with the ports and packet size features as discrete nominal intervals, not as continuous numbers, is the essential basis for accurate traffic classification (i.e., the features should be discretized first), regardless of classification algorithms to use.","PeriodicalId":20688,"journal":{"name":"Proceedings of The 6th International Conference on Innovation in Science and Technology","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2010-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"154","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of The 6th International Conference on Innovation in Science and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1921168.1921180","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 154

Abstract

Recent research on Internet traffic classification has yield a number of data mining techniques for distinguishing types of traffic, but no systematic analysis on "Why" some algorithms achieve high accuracies. In pursuit of empirically grounded answers to the "Why" question, which is critical in understanding and establishing a scientific ground for traffic classification research, this paper reveals the three sources of the discriminative power in classifying the Internet application traffic: (i) ports, (ii) the sizes of the first one-two (for UDP flows) or four-five (for TCP flows) packets, and (iii) discretization of those features. We find that C4.5 performs the best under any circumstances, as well as the reason why; because the algorithm discretizes input features during classification operations. We also find that the entropy-based Minimum Description Length discretization on ports and packet size features substantially improve the classification accuracy of every machine learning algorithm tested (by as much as 59.8%!) and make all of them achieve >93% accuracy on average without any algorithm-specific tuning processes. Our results indicate that dealing with the ports and packet size features as discrete nominal intervals, not as continuous numbers, is the essential basis for accurate traffic classification (i.e., the features should be discretized first), regardless of classification algorithms to use.
互联网流量分类的揭秘:辨别力的来源
近年来对互联网流量分类的研究已经产生了许多用于区分流量类型的数据挖掘技术,但对于某些算法为什么能够达到较高的准确率却没有系统的分析。为了寻求“为什么”问题的实证答案,这对于理解和建立流量分类研究的科学基础至关重要,本文揭示了对互联网应用流量进行分类的判别能力的三个来源:(i)端口,(ii)前两个(UDP流)或四个(TCP流)数据包的大小,以及(iii)这些特征的离散化。我们发现C4.5在任何情况下都表现最好,以及原因;因为该算法在分类操作中离散化了输入特征。我们还发现,端口和数据包大小特征上基于熵的最小描述长度离散化大大提高了所测试的每种机器学习算法的分类准确率(高达59.8%!),并且在没有任何特定算法调优过程的情况下,使它们平均达到bb0 93%的准确率。我们的研究结果表明,无论使用哪种分类算法,将端口和数据包大小特征作为离散的标称间隔(而不是连续的数字)处理是准确流分类的必要基础(即首先应将特征离散化)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信