Hybrid Approach with Membership-Density Based Oversampling for handling multi-class imbalance in Internet Traffic Identification with overlapping and noise
IF 4.1 3区 计算机科学Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
{"title":"Hybrid Approach with Membership-Density Based Oversampling for handling multi-class imbalance in Internet Traffic Identification with overlapping and noise","authors":"","doi":"10.1016/j.icte.2024.04.007","DOIUrl":null,"url":null,"abstract":"<div><div>Internet Traffic identification is a crucial method for monitoring Internet application activities and is essential for Internet management and security. Internet traffic data typically displays imbalanced distributions. The uneven distribution of instances in each class indicates the class imbalance problem. This problem can cause a decrease in classification performance because the classifier assumes the dataset has a balanced class distribution. Internet Traffic Identification dataset is often accompanied by overlapping and noise. The hybrid approach to handling class imbalances involving data-level and ensemble-based approaches is usually chosen to overcome this problem. Data-level with oversampling using SMOTE is the choice because of its ability to synthesize new samples for minority classes. However, SMOTE-generated samples tend to be noisy and overlap with the majority of samples. This research proposes the application of a Hybrid Approach with Membership-density-based Oversampling to tackle this challenge. This research emphasizes the importance of applying membership degrees in determining samples that will group samples into safe, overlapping, and noisy areas. Then, top samples will be selected based on density ratio, stability, and score for safe and overlapping safe areas. The study findings that the proposed method effectively addresses multi-class imbalances in six Internet Traffic Identification datasets, yielding slightly improved average accuracy, <span><math><mrow><msub><mrow><mi>F</mi></mrow><mrow><mi>b</mi></mrow></msub><mi>M</mi><mi>e</mi><mi>a</mi><mi>s</mi><mi>u</mi><mi>r</mi><mo>,</mo></mrow></math></span> and class balance accuracy results compared to other testing methods, though the difference is not statistically significant. The noise and overlapping scenes experiments demonstrate that the average accuracy obtained is superior, showing a considerable difference compared to all test methods.</div></div>","PeriodicalId":48526,"journal":{"name":"ICT Express","volume":"10 5","pages":"Pages 1094-1102"},"PeriodicalIF":4.1000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICT Express","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2405959524000444","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Internet Traffic identification is a crucial method for monitoring Internet application activities and is essential for Internet management and security. Internet traffic data typically displays imbalanced distributions. The uneven distribution of instances in each class indicates the class imbalance problem. This problem can cause a decrease in classification performance because the classifier assumes the dataset has a balanced class distribution. Internet Traffic Identification dataset is often accompanied by overlapping and noise. The hybrid approach to handling class imbalances involving data-level and ensemble-based approaches is usually chosen to overcome this problem. Data-level with oversampling using SMOTE is the choice because of its ability to synthesize new samples for minority classes. However, SMOTE-generated samples tend to be noisy and overlap with the majority of samples. This research proposes the application of a Hybrid Approach with Membership-density-based Oversampling to tackle this challenge. This research emphasizes the importance of applying membership degrees in determining samples that will group samples into safe, overlapping, and noisy areas. Then, top samples will be selected based on density ratio, stability, and score for safe and overlapping safe areas. The study findings that the proposed method effectively addresses multi-class imbalances in six Internet Traffic Identification datasets, yielding slightly improved average accuracy, and class balance accuracy results compared to other testing methods, though the difference is not statistically significant. The noise and overlapping scenes experiments demonstrate that the average accuracy obtained is superior, showing a considerable difference compared to all test methods.
期刊介绍:
The ICT Express journal published by the Korean Institute of Communications and Information Sciences (KICS) is an international, peer-reviewed research publication covering all aspects of information and communication technology. The journal aims to publish research that helps advance the theoretical and practical understanding of ICT convergence, platform technologies, communication networks, and device technologies. The technology advancement in information and communication technology (ICT) sector enables portable devices to be always connected while supporting high data rate, resulting in the recent popularity of smartphones that have a considerable impact in economic and social development.