HSNF: Hybrid sampling with two-step noise filtering for imbalanced data classification

IF 0.9 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Lilong Duan, Wei Xue, Xiaolei Gu, Xiao Luo, Yongsheng He
{"title":"HSNF: Hybrid sampling with two-step noise filtering for imbalanced data classification","authors":"Lilong Duan, Wei Xue, Xiaolei Gu, Xiao Luo, Yongsheng He","doi":"10.3233/ida-227111","DOIUrl":null,"url":null,"abstract":"Imbalanced data classification has received much attention in machine learning, and many oversampling methods exist to solve this problem. However, these methods may suffer from insufficient noise filtering, overlap between synthetic and original samples, etc., resulting in degradation of classification performance. To this end, we propose a hybrid sampling with two-step noise filtering (HSNF) method in this paper, which consists of three modules. In the first module, HSNF denoises twice according to different noise discrimination mechanisms. Note that denoising mechanism is essentially based on the Euclidean distance between samples. Then in the second module, the minority class samples are divided into two categories, boundary samples and safe samples, respectively, and a portion of the boundary majority class samples are removed. In the third module, different oversampling methods are used to synthesize instances for boundary minority class samples and safe minority class samples. Experimental results on synthetic data and benchmark datasets demonstrate the effectiveness of HSNF in comparison with several popular methods. The code of HSNF will be released.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":"5 1","pages":"0"},"PeriodicalIF":0.9000,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Data Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/ida-227111","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Imbalanced data classification has received much attention in machine learning, and many oversampling methods exist to solve this problem. However, these methods may suffer from insufficient noise filtering, overlap between synthetic and original samples, etc., resulting in degradation of classification performance. To this end, we propose a hybrid sampling with two-step noise filtering (HSNF) method in this paper, which consists of three modules. In the first module, HSNF denoises twice according to different noise discrimination mechanisms. Note that denoising mechanism is essentially based on the Euclidean distance between samples. Then in the second module, the minority class samples are divided into two categories, boundary samples and safe samples, respectively, and a portion of the boundary majority class samples are removed. In the third module, different oversampling methods are used to synthesize instances for boundary minority class samples and safe minority class samples. Experimental results on synthetic data and benchmark datasets demonstrate the effectiveness of HSNF in comparison with several popular methods. The code of HSNF will be released.
HSNF:用于不平衡数据分类的混合采样和两步噪声滤波
不平衡数据分类在机器学习中受到了广泛的关注,有许多过采样方法可以解决这一问题。然而,这些方法可能存在噪声滤波不足、合成样本与原始样本重叠等问题,导致分类性能下降。为此,本文提出了一种混合采样与两步噪声滤波(HSNF)方法,该方法由三个模块组成。在第一个模块中,HSNF根据不同的噪声识别机制进行两次降噪。注意去噪机制本质上是基于样本之间的欧几里得距离。然后在第二个模块中,将少数类样本分别分为边界样本和安全样本两类,并去除一部分边界多数类样本。在第三个模块中,使用不同的过采样方法合成边界少数类样本和安全少数类样本的实例。在综合数据和基准数据集上的实验结果表明,与几种常用方法相比,HSNF方法是有效的。HSNF的代码即将发布。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Intelligent Data Analysis
Intelligent Data Analysis 工程技术-计算机:人工智能
CiteScore
2.20
自引率
5.90%
发文量
85
审稿时长
3.3 months
期刊介绍: Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing. In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信