{"title":"基于稀疏类相关开发的长尾数据增强","authors":"Mengnan Qi;Shasha Mao;Yimeng Zhang;Jing Gu;Shuiping Gou;Licheng Jiao;Yuming Zhang","doi":"10.1109/TKDE.2025.3573899","DOIUrl":null,"url":null,"abstract":"The long-tailed data distribution frequently occurs in the real-world scenarios, whereas deep learning is not effective enough for such distribution. In order to improve the effectiveness for the long-tailed data, data augmentation is widely used to balance the distribution of classes by generating new samples. However, most existing studies are designed from the perspective of the class-independence assumption by default, ignoring the effect of interrelation among classes for data augmentation, which causes that some generated samples may be unrepresentative and useless for balancing the class-distribution. Inspired by this, we propose a new data augmentation method based the sparse class-correlation exploitation in this paper, which can generate more representative samples by utilizing the class-correlation, to effectively balance the class-distribution for the long-tailed data. In the proposed method, a sparse class-correlation exploration module is first proposed to explore the potential correlations among multiple classes for boosting the classification performance. Based on the class-correlations, the pivotal seed-samples are generated by maximizing the sparse representation of challenging samples. Meanwhile, an ambiguity-filtered translation module is designed to generate more representative new samples for the target classes based the obtained seed-samples by enhancing the class-consistency and suppressing the deviation from the target classes. In addition, we introduce the self-supervised feature and fuse it with the discriminative feature to explore more accurate class-correlations. Experimental results illustrate that the proposed method obtains better performance only with a small number of generated samples than the state-of-the-art methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 8","pages":"4497-4511"},"PeriodicalIF":8.9000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DASCE: Long-Tailed Data Augmentation Based Sparse Class-Correlation Exploitation\",\"authors\":\"Mengnan Qi;Shasha Mao;Yimeng Zhang;Jing Gu;Shuiping Gou;Licheng Jiao;Yuming Zhang\",\"doi\":\"10.1109/TKDE.2025.3573899\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The long-tailed data distribution frequently occurs in the real-world scenarios, whereas deep learning is not effective enough for such distribution. In order to improve the effectiveness for the long-tailed data, data augmentation is widely used to balance the distribution of classes by generating new samples. However, most existing studies are designed from the perspective of the class-independence assumption by default, ignoring the effect of interrelation among classes for data augmentation, which causes that some generated samples may be unrepresentative and useless for balancing the class-distribution. Inspired by this, we propose a new data augmentation method based the sparse class-correlation exploitation in this paper, which can generate more representative samples by utilizing the class-correlation, to effectively balance the class-distribution for the long-tailed data. In the proposed method, a sparse class-correlation exploration module is first proposed to explore the potential correlations among multiple classes for boosting the classification performance. Based on the class-correlations, the pivotal seed-samples are generated by maximizing the sparse representation of challenging samples. Meanwhile, an ambiguity-filtered translation module is designed to generate more representative new samples for the target classes based the obtained seed-samples by enhancing the class-consistency and suppressing the deviation from the target classes. In addition, we introduce the self-supervised feature and fuse it with the discriminative feature to explore more accurate class-correlations. Experimental results illustrate that the proposed method obtains better performance only with a small number of generated samples than the state-of-the-art methods.\",\"PeriodicalId\":13496,\"journal\":{\"name\":\"IEEE Transactions on Knowledge and Data Engineering\",\"volume\":\"37 8\",\"pages\":\"4497-4511\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2025-06-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Knowledge and Data Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11021006/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11021006/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
DASCE: Long-Tailed Data Augmentation Based Sparse Class-Correlation Exploitation
The long-tailed data distribution frequently occurs in the real-world scenarios, whereas deep learning is not effective enough for such distribution. In order to improve the effectiveness for the long-tailed data, data augmentation is widely used to balance the distribution of classes by generating new samples. However, most existing studies are designed from the perspective of the class-independence assumption by default, ignoring the effect of interrelation among classes for data augmentation, which causes that some generated samples may be unrepresentative and useless for balancing the class-distribution. Inspired by this, we propose a new data augmentation method based the sparse class-correlation exploitation in this paper, which can generate more representative samples by utilizing the class-correlation, to effectively balance the class-distribution for the long-tailed data. In the proposed method, a sparse class-correlation exploration module is first proposed to explore the potential correlations among multiple classes for boosting the classification performance. Based on the class-correlations, the pivotal seed-samples are generated by maximizing the sparse representation of challenging samples. Meanwhile, an ambiguity-filtered translation module is designed to generate more representative new samples for the target classes based the obtained seed-samples by enhancing the class-consistency and suppressing the deviation from the target classes. In addition, we introduce the self-supervised feature and fuse it with the discriminative feature to explore more accurate class-correlations. Experimental results illustrate that the proposed method obtains better performance only with a small number of generated samples than the state-of-the-art methods.
期刊介绍:
The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.