基于稀疏类相关开发的长尾数据增强

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering Pub Date : 2025-06-02 DOI:10.1109/TKDE.2025.3573899

Mengnan Qi;Shasha Mao;Yimeng Zhang;Jing Gu;Shuiping Gou;Licheng Jiao;Yuming Zhang

{"title":"基于稀疏类相关开发的长尾数据增强","authors":"Mengnan Qi;Shasha Mao;Yimeng Zhang;Jing Gu;Shuiping Gou;Licheng Jiao;Yuming Zhang","doi":"10.1109/TKDE.2025.3573899","DOIUrl":null,"url":null,"abstract":"The long-tailed data distribution frequently occurs in the real-world scenarios, whereas deep learning is not effective enough for such distribution. In order to improve the effectiveness for the long-tailed data, data augmentation is widely used to balance the distribution of classes by generating new samples. However, most existing studies are designed from the perspective of the class-independence assumption by default, ignoring the effect of interrelation among classes for data augmentation, which causes that some generated samples may be unrepresentative and useless for balancing the class-distribution. Inspired by this, we propose a new data augmentation method based the sparse class-correlation exploitation in this paper, which can generate more representative samples by utilizing the class-correlation, to effectively balance the class-distribution for the long-tailed data. In the proposed method, a sparse class-correlation exploration module is first proposed to explore the potential correlations among multiple classes for boosting the classification performance. Based on the class-correlations, the pivotal seed-samples are generated by maximizing the sparse representation of challenging samples. Meanwhile, an ambiguity-filtered translation module is designed to generate more representative new samples for the target classes based the obtained seed-samples by enhancing the class-consistency and suppressing the deviation from the target classes. In addition, we introduce the self-supervised feature and fuse it with the discriminative feature to explore more accurate class-correlations. Experimental results illustrate that the proposed method obtains better performance only with a small number of generated samples than the state-of-the-art methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 8","pages":"4497-4511"},"PeriodicalIF":10.4000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DASCE: Long-Tailed Data Augmentation Based Sparse Class-Correlation Exploitation\",\"authors\":\"Mengnan Qi;Shasha Mao;Yimeng Zhang;Jing Gu;Shuiping Gou;Licheng Jiao;Yuming Zhang\",\"doi\":\"10.1109/TKDE.2025.3573899\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The long-tailed data distribution frequently occurs in the real-world scenarios, whereas deep learning is not effective enough for such distribution. In order to improve the effectiveness for the long-tailed data, data augmentation is widely used to balance the distribution of classes by generating new samples. However, most existing studies are designed from the perspective of the class-independence assumption by default, ignoring the effect of interrelation among classes for data augmentation, which causes that some generated samples may be unrepresentative and useless for balancing the class-distribution. Inspired by this, we propose a new data augmentation method based the sparse class-correlation exploitation in this paper, which can generate more representative samples by utilizing the class-correlation, to effectively balance the class-distribution for the long-tailed data. In the proposed method, a sparse class-correlation exploration module is first proposed to explore the potential correlations among multiple classes for boosting the classification performance. Based on the class-correlations, the pivotal seed-samples are generated by maximizing the sparse representation of challenging samples. Meanwhile, an ambiguity-filtered translation module is designed to generate more representative new samples for the target classes based the obtained seed-samples by enhancing the class-consistency and suppressing the deviation from the target classes. In addition, we introduce the self-supervised feature and fuse it with the discriminative feature to explore more accurate class-correlations. Experimental results illustrate that the proposed method obtains better performance only with a small number of generated samples than the state-of-the-art methods.\",\"PeriodicalId\":13496,\"journal\":{\"name\":\"IEEE Transactions on Knowledge and Data Engineering\",\"volume\":\"37 8\",\"pages\":\"4497-4511\"},\"PeriodicalIF\":10.4000,\"publicationDate\":\"2025-06-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Knowledge and Data Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11021006/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11021006/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

长尾数据分布在现实场景中经常出现，而深度学习对于这种分布的效果不够。为了提高对长尾数据的有效性，数据增强被广泛用于通过生成新样本来平衡类的分布。然而，现有的研究大多默认从类独立假设的角度进行设计，忽略了类之间相互关系对数据增强的影响，这导致一些生成的样本可能不具有代表性，无法平衡类分布。受此启发，本文提出了一种基于稀疏类相关挖掘的数据增强方法，利用类相关生成更具代表性的样本，有效平衡长尾数据的类分布。在该方法中，首先提出了稀疏类相关性探索模块来探索多个类之间的潜在相关性，以提高分类性能。基于类相关性，通过最大化挑战性样本的稀疏表示来生成关键种子样本。同时，设计了歧义过滤翻译模块，通过增强类一致性和抑制与目标类的偏差，在获得种子样本的基础上为目标类生成更具代表性的新样本。此外，我们引入了自监督特征，并将其与判别特征融合，以探索更准确的类相关性。实验结果表明，该方法在生成少量样本的情况下就能获得比现有方法更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DASCE: Long-Tailed Data Augmentation Based Sparse Class-Correlation Exploitation

The long-tailed data distribution frequently occurs in the real-world scenarios, whereas deep learning is not effective enough for such distribution. In order to improve the effectiveness for the long-tailed data, data augmentation is widely used to balance the distribution of classes by generating new samples. However, most existing studies are designed from the perspective of the class-independence assumption by default, ignoring the effect of interrelation among classes for data augmentation, which causes that some generated samples may be unrepresentative and useless for balancing the class-distribution. Inspired by this, we propose a new data augmentation method based the sparse class-correlation exploitation in this paper, which can generate more representative samples by utilizing the class-correlation, to effectively balance the class-distribution for the long-tailed data. In the proposed method, a sparse class-correlation exploration module is first proposed to explore the potential correlations among multiple classes for boosting the classification performance. Based on the class-correlations, the pivotal seed-samples are generated by maximizing the sparse representation of challenging samples. Meanwhile, an ambiguity-filtered translation module is designed to generate more representative new samples for the target classes based the obtained seed-samples by enhancing the class-consistency and suppressing the deviation from the target classes. In addition, we introduce the self-supervised feature and fuse it with the discriminative feature to explore more accurate class-correlations. Experimental results illustrate that the proposed method obtains better performance only with a small number of generated samples than the state-of-the-art methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Knowledge and Data Engineering 工程技术-工程：电子与电气

CiteScore

11.70

自引率

3.40%

发文量

515

审稿时长

6 months

期刊介绍： The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.