Preserving text space integrity for robust compositional zero-shot learning via mixture of pretrained experts

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2024-10-28 DOI:10.1016/j.neucom.2024.128773

Zehua Hao, Fang Liu, Licheng Jiao, Yaoyang Du, Shuo Li, Hao Wang, Pengfang Li, Xu Liu, Puhua Chen

{"title":"Preserving text space integrity for robust compositional zero-shot learning via mixture of pretrained experts","authors":"Zehua Hao, Fang Liu, Licheng Jiao, Yaoyang Du, Shuo Li, Hao Wang, Pengfang Li, Xu Liu, Puhua Chen","doi":"10.1016/j.neucom.2024.128773","DOIUrl":null,"url":null,"abstract":"<div><div>In the current landscape of Compositional Zero-Shot Learning (CZSL) methods that leverage CLIP, the predominant approach is based on prompt learning paradigms. These methods encounter significant computational complexity when dealing with a large number of categories. Additionally, when confronted with new classification tasks, there is a necessity to learn the prompts again, which can be both time-consuming and resource-intensive. To address these challenges, We present a new methodology, named the <strong>M</strong>ixture of <strong>P</strong>retrained <strong>E</strong>xpert (MoPE), for enhancing Compositional Zero-shot Learning through Logit-Level Fusion with Multi Expert Fusion Module. The MoPE skillfully blends the benefits of extensive pre-trained models like CLIP, Bert, GPT-3 and Word2Vec for effectively tackling Compositional Zero-shot Learning. Firstly, we extract the text label space for each language model individually, then map the visual feature vectors to their respective text spaces. This maintains the integrity and structure of the original text space. During this process, the pre-trained expert parameters are kept frozen. The mapping of visual features to the corresponding text spaces is subject to learning and could be considered as multiple learnable visual experts. In the model fusion phase, we propose a new fusion strategy that features a gating mechanism that adjusts the contributions of various models dynamically. This enables our approach to adapt more effectively to a range of tasks and data sets. The method’s robustness is demonstrated by the fact that the language model is not tailored to specific downstream task datasets or losses. This preserves the larger model’s topology and expands the potential for application. Preliminary experiments conducted on the UT-Zappos, AO-Clever, and C-GQA datasets indicate that MoPE performs competitively when compared to existing techniques.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128773"},"PeriodicalIF":5.5000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224015443","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In the current landscape of Compositional Zero-Shot Learning (CZSL) methods that leverage CLIP, the predominant approach is based on prompt learning paradigms. These methods encounter significant computational complexity when dealing with a large number of categories. Additionally, when confronted with new classification tasks, there is a necessity to learn the prompts again, which can be both time-consuming and resource-intensive. To address these challenges, We present a new methodology, named the Mixture of Pretrained Expert (MoPE), for enhancing Compositional Zero-shot Learning through Logit-Level Fusion with Multi Expert Fusion Module. The MoPE skillfully blends the benefits of extensive pre-trained models like CLIP, Bert, GPT-3 and Word2Vec for effectively tackling Compositional Zero-shot Learning. Firstly, we extract the text label space for each language model individually, then map the visual feature vectors to their respective text spaces. This maintains the integrity and structure of the original text space. During this process, the pre-trained expert parameters are kept frozen. The mapping of visual features to the corresponding text spaces is subject to learning and could be considered as multiple learnable visual experts. In the model fusion phase, we propose a new fusion strategy that features a gating mechanism that adjusts the contributions of various models dynamically. This enables our approach to adapt more effectively to a range of tasks and data sets. The method’s robustness is demonstrated by the fact that the language model is not tailored to specific downstream task datasets or losses. This preserves the larger model’s topology and expands the potential for application. Preliminary experiments conducted on the UT-Zappos, AO-Clever, and C-GQA datasets indicate that MoPE performs competitively when compared to existing techniques.

查看原文本刊更多论文

通过混合预训练专家，保持文本空间的完整性，实现稳健的合成零点学习

在当前利用 CLIP 的合成零点学习（CZSL）方法中，最主要的方法是基于提示学习范式。这些方法在处理大量类别时会遇到很大的计算复杂性。此外，在面对新的分类任务时，有必要再次学习提示，这可能既耗时又耗资源。为了应对这些挑战，我们提出了一种名为 "预训练专家混合物"（MoPE）的新方法，通过与多专家融合模块的 Logit 级融合来增强合成零点学习。MoPE 巧妙地融合了大量预训练模型（如 CLIP、Bert、GPT-3 和 Word2Vec）的优点，从而有效地解决了合成零镜头学习问题。首先，我们为每个语言模型单独提取文本标签空间，然后将视觉特征向量映射到各自的文本空间。这就保持了原始文本空间的完整性和结构性。在此过程中，预先训练好的专家参数将被冻结。视觉特征到相应文本空间的映射是可以学习的，可以视为多个可学习的视觉专家。在模型融合阶段，我们提出了一种新的融合策略，其特点是采用门控机制，动态调整各种模型的贡献。这使我们的方法能够更有效地适应一系列任务和数据集。该方法的稳健性体现在语言模型并不针对特定的下游任务数据集或损失而量身定制。这就保留了更大模型的拓扑结构，并扩大了应用潜力。在UT-Zappos、AO-Clever 和 C-GQA 数据集上进行的初步实验表明，与现有技术相比，MoPE 的性能极具竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.