Internet traffic classification demystified: on the sources of the discriminative power

Proceedings of The 6th International Conference on Innovation in Science and Technology Pub Date : 2010-11-30 DOI:10.1145/1921168.1921180

Yeon-sup Lim, Hyunchul Kim, Jiwoong Jeong, Chong-kwon Kim, T. Kwon, Yanghee Choi

{"title":"Internet traffic classification demystified: on the sources of the discriminative power","authors":"Yeon-sup Lim, Hyunchul Kim, Jiwoong Jeong, Chong-kwon Kim, T. Kwon, Yanghee Choi","doi":"10.1145/1921168.1921180","DOIUrl":null,"url":null,"abstract":"Recent research on Internet traffic classification has yield a number of data mining techniques for distinguishing types of traffic, but no systematic analysis on \"Why\" some algorithms achieve high accuracies. In pursuit of empirically grounded answers to the \"Why\" question, which is critical in understanding and establishing a scientific ground for traffic classification research, this paper reveals the three sources of the discriminative power in classifying the Internet application traffic: (i) ports, (ii) the sizes of the first one-two (for UDP flows) or four-five (for TCP flows) packets, and (iii) discretization of those features. We find that C4.5 performs the best under any circumstances, as well as the reason why; because the algorithm discretizes input features during classification operations. We also find that the entropy-based Minimum Description Length discretization on ports and packet size features substantially improve the classification accuracy of every machine learning algorithm tested (by as much as 59.8%!) and make all of them achieve >93% accuracy on average without any algorithm-specific tuning processes. Our results indicate that dealing with the ports and packet size features as discrete nominal intervals, not as continuous numbers, is the essential basis for accurate traffic classification (i.e., the features should be discretized first), regardless of classification algorithms to use.","PeriodicalId":20688,"journal":{"name":"Proceedings of The 6th International Conference on Innovation in Science and Technology","volume":"73 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2010-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"154","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of The 6th International Conference on Innovation in Science and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1921168.1921180","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 154

Abstract

Recent research on Internet traffic classification has yield a number of data mining techniques for distinguishing types of traffic, but no systematic analysis on "Why" some algorithms achieve high accuracies. In pursuit of empirically grounded answers to the "Why" question, which is critical in understanding and establishing a scientific ground for traffic classification research, this paper reveals the three sources of the discriminative power in classifying the Internet application traffic: (i) ports, (ii) the sizes of the first one-two (for UDP flows) or four-five (for TCP flows) packets, and (iii) discretization of those features. We find that C4.5 performs the best under any circumstances, as well as the reason why; because the algorithm discretizes input features during classification operations. We also find that the entropy-based Minimum Description Length discretization on ports and packet size features substantially improve the classification accuracy of every machine learning algorithm tested (by as much as 59.8%!) and make all of them achieve >93% accuracy on average without any algorithm-specific tuning processes. Our results indicate that dealing with the ports and packet size features as discrete nominal intervals, not as continuous numbers, is the essential basis for accurate traffic classification (i.e., the features should be discretized first), regardless of classification algorithms to use.

查看原文本刊更多论文

互联网流量分类的揭秘:辨别力的来源

近年来对互联网流量分类的研究已经产生了许多用于区分流量类型的数据挖掘技术，但对于某些算法为什么能够达到较高的准确率却没有系统的分析。为了寻求“为什么”问题的实证答案，这对于理解和建立流量分类研究的科学基础至关重要，本文揭示了对互联网应用流量进行分类的判别能力的三个来源:(i)端口，(ii)前两个(UDP流)或四个(TCP流)数据包的大小，以及(iii)这些特征的离散化。我们发现C4.5在任何情况下都表现最好，以及原因;因为该算法在分类操作中离散化了输入特征。我们还发现，端口和数据包大小特征上基于熵的最小描述长度离散化大大提高了所测试的每种机器学习算法的分类准确率(高达59.8%!)，并且在没有任何特定算法调优过程的情况下，使它们平均达到bb0 93%的准确率。我们的研究结果表明，无论使用哪种分类算法，将端口和数据包大小特征作为离散的标称间隔(而不是连续的数字)处理是准确流分类的必要基础(即首先应将特征离散化)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of The 6th International Conference on Innovation in Science and Technology

自引率

0.00%

发文量