Zhiyu Zhang, Da Liu, Shengqiang Liu, Anna Wang, Jie Gao, Yali Li
{"title":"Turbo your multi-modal classification with contrastive learning","authors":"Zhiyu Zhang, Da Liu, Shengqiang Liu, Anna Wang, Jie Gao, Yali Li","doi":"arxiv-2409.09282","DOIUrl":null,"url":null,"abstract":"Contrastive learning has become one of the most impressive approaches for\nmulti-modal representation learning. However, previous multi-modal works mainly\nfocused on cross-modal understanding, ignoring in-modal contrastive learning,\nwhich limits the representation of each modality. In this paper, we propose a\nnovel contrastive learning strategy, called $Turbo$, to promote multi-modal\nunderstanding by joint in-modal and cross-modal contrastive learning.\nSpecifically, multi-modal data pairs are sent through the forward pass twice\nwith different hidden dropout masks to get two different representations for\neach modality. With these representations, we obtain multiple in-modal and\ncross-modal contrastive objectives for training. Finally, we combine the\nself-supervised Turbo with the supervised multi-modal classification and\ndemonstrate its effectiveness on two audio-text classification tasks, where the\nstate-of-the-art performance is achieved on a speech emotion recognition\nbenchmark dataset.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09282","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Contrastive learning has become one of the most impressive approaches for
multi-modal representation learning. However, previous multi-modal works mainly
focused on cross-modal understanding, ignoring in-modal contrastive learning,
which limits the representation of each modality. In this paper, we propose a
novel contrastive learning strategy, called $Turbo$, to promote multi-modal
understanding by joint in-modal and cross-modal contrastive learning.
Specifically, multi-modal data pairs are sent through the forward pass twice
with different hidden dropout masks to get two different representations for
each modality. With these representations, we obtain multiple in-modal and
cross-modal contrastive objectives for training. Finally, we combine the
self-supervised Turbo with the supervised multi-modal classification and
demonstrate its effectiveness on two audio-text classification tasks, where the
state-of-the-art performance is achieved on a speech emotion recognition
benchmark dataset.