Turbo your multi-modal classification with contrastive learning

arXiv - CS - Multimedia Pub Date : 2024-09-14 DOI:arxiv-2409.09282

Zhiyu Zhang, Da Liu, Shengqiang Liu, Anna Wang, Jie Gao, Yali Li

引用次数: 0

Abstract

Contrastive learning has become one of the most impressive approaches for multi-modal representation learning. However, previous multi-modal works mainly focused on cross-modal understanding, ignoring in-modal contrastive learning, which limits the representation of each modality. In this paper, we propose a novel contrastive learning strategy, called $Turbo$, to promote multi-modal understanding by joint in-modal and cross-modal contrastive learning. Specifically, multi-modal data pairs are sent through the forward pass twice with different hidden dropout masks to get two different representations for each modality. With these representations, we obtain multiple in-modal and cross-modal contrastive objectives for training. Finally, we combine the self-supervised Turbo with the supervised multi-modal classification and demonstrate its effectiveness on two audio-text classification tasks, where the state-of-the-art performance is achieved on a speech emotion recognition benchmark dataset.

查看原文本刊更多论文

通过对比学习提升多模态分类能力

对比学习已成为多模态表征学习中最令人印象深刻的方法之一。然而，以往的多模态研究主要关注跨模态理解，忽视了模态内对比学习，从而限制了每种模态的表征。本文提出了一种新的对比学习策略，称为 "涡轮"（Turbo），通过模内和跨模态对比学习来促进多模态理解。有了这些表征，我们就能得到多个模态内和跨模态对比目标，用于训练。最后，我们将自我监督 Turbo 与监督多模态分类相结合，并在两个音频-文本分类任务中演示了其有效性，其中在语音情感识别基准数据集上取得了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量