ANALYSIS OF THE INFLUENCE OF MACHINE LEARNING ALGORITHM PARAMETERS ON THE RESULTS OF TRAFFIC CLASSIFICATION IN REAL TIME

T-Comm Pub Date : 1900-01-01 DOI:10.36724/2072-8735-2021-15-9-24-35

I. Krasnova

{"title":"ANALYSIS OF THE INFLUENCE OF MACHINE LEARNING ALGORITHM PARAMETERS ON THE RESULTS OF TRAFFIC CLASSIFICATION IN REAL TIME","authors":"I. Krasnova","doi":"10.36724/2072-8735-2021-15-9-24-35","DOIUrl":null,"url":null,"abstract":"The paper analyzes the impact of setting the parameters of Machine Learning algorithms on the results of traffic classification in real-time. The Random Forest and XGBoost algorithms are considered. A brief description of the work of both methods and methods for evaluating the results of classification is given. Experimental studies are conducted on a database obtained on a real network, separately for TCP and UDP flows. In order for the results of the study to be used in real time, a special feature matrix is created based on the first 15 packets of the flow. The main parameters of the Random Forest (RF) algorithm for configuration are the number of trees, the partition criterion used, the maximum number of features for constructing the partition function, the depth of the tree, and the minimum number of samples in the node and in the leaf. For XGBoost, the number of trees, the depth of the tree, the minimum number of samples in the leaf, for features, and the percentage of samples needed to build the tree are taken. Increasing the number of trees leads to an increase in accuracy to a certain value, but as shown in the article, it is important to make sure that the model is not overfitted. To combat overfitting, the remaining parameters of the trees are used. In the data set under study, by eliminating overfitting, it was possible to achieve an increase in classification accuracy for individual applications by 11-12% for Random Forest and by 12-19% for XGBoost. The results show that setting the parameters is a very important step in building a traffic classification model, because it helps to combat overfitting and significantly increases the accuracy of the algorithm’s predictions. In addition, it was shown that if the parameters are properly configured, XGBoost, which is not very popular in traffic classification works, becomes a competitive algorithm and shows better results compared to the widespread Random Forest.","PeriodicalId":263691,"journal":{"name":"T-Comm","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"T-Comm","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.36724/2072-8735-2021-15-9-24-35","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The paper analyzes the impact of setting the parameters of Machine Learning algorithms on the results of traffic classification in real-time. The Random Forest and XGBoost algorithms are considered. A brief description of the work of both methods and methods for evaluating the results of classification is given. Experimental studies are conducted on a database obtained on a real network, separately for TCP and UDP flows. In order for the results of the study to be used in real time, a special feature matrix is created based on the first 15 packets of the flow. The main parameters of the Random Forest (RF) algorithm for configuration are the number of trees, the partition criterion used, the maximum number of features for constructing the partition function, the depth of the tree, and the minimum number of samples in the node and in the leaf. For XGBoost, the number of trees, the depth of the tree, the minimum number of samples in the leaf, for features, and the percentage of samples needed to build the tree are taken. Increasing the number of trees leads to an increase in accuracy to a certain value, but as shown in the article, it is important to make sure that the model is not overfitted. To combat overfitting, the remaining parameters of the trees are used. In the data set under study, by eliminating overfitting, it was possible to achieve an increase in classification accuracy for individual applications by 11-12% for Random Forest and by 12-19% for XGBoost. The results show that setting the parameters is a very important step in building a traffic classification model, because it helps to combat overfitting and significantly increases the accuracy of the algorithm’s predictions. In addition, it was shown that if the parameters are properly configured, XGBoost, which is not very popular in traffic classification works, becomes a competitive algorithm and shows better results compared to the widespread Random Forest.

查看原文本刊更多论文

实时分析机器学习算法参数对流量分类结果的影响

本文分析了实时设置机器学习算法参数对流量分类结果的影响。考虑了随机森林和XGBoost算法。简要介绍了这两种方法的工作以及评价分类结果的方法。在实际网络中获取的数据库上分别对TCP流和UDP流进行了实验研究。为了使研究结果能够实时使用，基于流的前15个包创建了一个特殊的特征矩阵。随机森林(Random Forest, RF)算法配置的主要参数是树的数量、使用的划分标准、构造划分函数的最大特征数量、树的深度以及节点和叶子中的最小样本数量。对于XGBoost，将获取树的数量、树的深度、叶子中的最小样本数、特征以及构建树所需的样本百分比。增加树的数量会将精度提高到一定的值，但正如本文所示，确保模型没有过拟合是很重要的。为了防止过度拟合，使用树的剩余参数。在研究的数据集中，通过消除过拟合，随机森林可以将单个应用的分类精度提高11-12%，XGBoost可以提高12-19%。结果表明，设置参数是构建流量分类模型非常重要的一步，因为它有助于防止过拟合，并显著提高算法预测的准确性。此外，如果参数配置得当，在流量分类工作中不太流行的XGBoost算法将成为一种有竞争力的算法，并且与广泛使用的Random Forest相比具有更好的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

T-Comm

自引率

0.00%

发文量