{"title":"基于切比雪夫不等式的新型集合超采样方法,适用于不平衡多标签数据","authors":"Weishuo Ren , Yifeng Zheng , Wenjie Zhang , Depeng Qing , Xianlong Zeng , Guohe Li","doi":"10.1016/j.neucom.2024.128717","DOIUrl":null,"url":null,"abstract":"<div><div>With the development of intelligent technology, data exhibits characteristics of multi-label and imbalanced distribution, which lead to the degradation of classification model performance. Therefore, addressing multi-label class imbalance has become a hot research topic. Nowadays, over-sampling approaches aim to generate a superset of the original dataset to deal with imbalanced data. However, traditional over-sampling methods only employ the central data point and its nearest neighbor samples to synthesize samples without considering the impact of data distribution. To address these issues, in this paper, we propose an ensemble multi-label over-sampling algorithm (MLCIO) based on Chebyshev inequality and a group optimization strategy. Firstly, to generate more representative and diverse samples, with the seed sample serving as the sphere’s center, Chebyshev inequality is utilized to ensure that synthetic samples fall within its <span><math><mi>m</mi></math></span> times the standard deviation. Secondly, a group optimization ranking weighting approach is employed to obtain more reliable and stable label information. Finally, comparative experiments are conducted on 11 imbalanced datasets from various domains using different evaluation metrics. The results demonstrate that our proposal achieves better performance than other approaches.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":null,"pages":null},"PeriodicalIF":5.5000,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A novel ensemble over-sampling approach based Chebyshev inequality for imbalanced multi-label data\",\"authors\":\"Weishuo Ren , Yifeng Zheng , Wenjie Zhang , Depeng Qing , Xianlong Zeng , Guohe Li\",\"doi\":\"10.1016/j.neucom.2024.128717\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>With the development of intelligent technology, data exhibits characteristics of multi-label and imbalanced distribution, which lead to the degradation of classification model performance. Therefore, addressing multi-label class imbalance has become a hot research topic. Nowadays, over-sampling approaches aim to generate a superset of the original dataset to deal with imbalanced data. However, traditional over-sampling methods only employ the central data point and its nearest neighbor samples to synthesize samples without considering the impact of data distribution. To address these issues, in this paper, we propose an ensemble multi-label over-sampling algorithm (MLCIO) based on Chebyshev inequality and a group optimization strategy. Firstly, to generate more representative and diverse samples, with the seed sample serving as the sphere’s center, Chebyshev inequality is utilized to ensure that synthetic samples fall within its <span><math><mi>m</mi></math></span> times the standard deviation. Secondly, a group optimization ranking weighting approach is employed to obtain more reliable and stable label information. Finally, comparative experiments are conducted on 11 imbalanced datasets from various domains using different evaluation metrics. The results demonstrate that our proposal achieves better performance than other approaches.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-10-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224014887\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224014887","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
随着智能技术的发展,数据呈现出多标签和分布不平衡的特点,导致分类模型性能下降。因此,解决多标签类不平衡问题已成为研究热点。目前,超采样方法旨在生成原始数据集的超集来处理不平衡数据。然而,传统的过度采样方法只利用中心数据点及其最近邻样本来合成样本,而没有考虑数据分布的影响。针对这些问题,本文提出了一种基于切比雪夫不等式和分组优化策略的集合多标签过采样算法(MLCIO)。首先,为了生成更具代表性和多样性的样本,以种子样本为球心,利用切比雪夫不等式确保合成样本落在其 m 倍标准偏差范围内。其次,采用分组优化排序加权法,以获得更可靠、更稳定的标签信息。最后,我们使用不同的评估指标在 11 个不同领域的不平衡数据集上进行了对比实验。结果表明,我们的建议比其他方法取得了更好的性能。
A novel ensemble over-sampling approach based Chebyshev inequality for imbalanced multi-label data
With the development of intelligent technology, data exhibits characteristics of multi-label and imbalanced distribution, which lead to the degradation of classification model performance. Therefore, addressing multi-label class imbalance has become a hot research topic. Nowadays, over-sampling approaches aim to generate a superset of the original dataset to deal with imbalanced data. However, traditional over-sampling methods only employ the central data point and its nearest neighbor samples to synthesize samples without considering the impact of data distribution. To address these issues, in this paper, we propose an ensemble multi-label over-sampling algorithm (MLCIO) based on Chebyshev inequality and a group optimization strategy. Firstly, to generate more representative and diverse samples, with the seed sample serving as the sphere’s center, Chebyshev inequality is utilized to ensure that synthetic samples fall within its times the standard deviation. Secondly, a group optimization ranking weighting approach is employed to obtain more reliable and stable label information. Finally, comparative experiments are conducted on 11 imbalanced datasets from various domains using different evaluation metrics. The results demonstrate that our proposal achieves better performance than other approaches.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.