{"title":"多模态表征学习中的放松对比","authors":"Zudi Lin, Erhan Bas, Kunwar Yashraj Singh, Gurumurthy Swaminathan, Rahul Bhotika","doi":"10.1109/WACV56688.2023.00226","DOIUrl":null,"url":null,"abstract":"Multimodal representation learning for images with paired raw texts can improve the usability and generality of the learned semantic concepts while significantly reducing annotation costs. In this paper, we explore the design space of loss functions in visual-linguistic pretraining frameworks and propose a novel Relaxed Contrastive (ReCo) objective, which act as a drop-in replacement of the widely used InfoNCE loss. The key insight of ReCo is to allow a relaxed negative space by not penalizing unpaired multimodal samples (i.e., negative pairs) that are already orthogonal or negatively correlated. Unlike the widely-used InfoNCE, which keeps repelling negative pairs as long as they are not anti-correlated, ReCo by design embraces more diversity and flexibility of the learned embeddings. We conduct exten-sive experiments using ReCo with state-of-the-art models by pretraining on the MIMIC-CXR dataset that consists of chest radiographs and free-text radiology reports, and eval-uating on the CheXpert dataset for multimodal retrieval and disease classification. Our ReCo achieves an absolute improvement of 2.9% over the InfoNCE baseline on the CheXpert Retrieval dataset in average retrieval precision and re-ports better or comparable performance in the linear evaluation and finetuning for classification. We further show that ReCo outperforms InfoNCE on the Flickr30K dataset by 1.7% in retrieval Recall@1, demonstrating the generalizability of our approach to natural images.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"262 1-2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Relaxing Contrastiveness in Multimodal Representation Learning\",\"authors\":\"Zudi Lin, Erhan Bas, Kunwar Yashraj Singh, Gurumurthy Swaminathan, Rahul Bhotika\",\"doi\":\"10.1109/WACV56688.2023.00226\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal representation learning for images with paired raw texts can improve the usability and generality of the learned semantic concepts while significantly reducing annotation costs. In this paper, we explore the design space of loss functions in visual-linguistic pretraining frameworks and propose a novel Relaxed Contrastive (ReCo) objective, which act as a drop-in replacement of the widely used InfoNCE loss. The key insight of ReCo is to allow a relaxed negative space by not penalizing unpaired multimodal samples (i.e., negative pairs) that are already orthogonal or negatively correlated. Unlike the widely-used InfoNCE, which keeps repelling negative pairs as long as they are not anti-correlated, ReCo by design embraces more diversity and flexibility of the learned embeddings. We conduct exten-sive experiments using ReCo with state-of-the-art models by pretraining on the MIMIC-CXR dataset that consists of chest radiographs and free-text radiology reports, and eval-uating on the CheXpert dataset for multimodal retrieval and disease classification. Our ReCo achieves an absolute improvement of 2.9% over the InfoNCE baseline on the CheXpert Retrieval dataset in average retrieval precision and re-ports better or comparable performance in the linear evaluation and finetuning for classification. We further show that ReCo outperforms InfoNCE on the Flickr30K dataset by 1.7% in retrieval Recall@1, demonstrating the generalizability of our approach to natural images.\",\"PeriodicalId\":270631,\"journal\":{\"name\":\"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)\",\"volume\":\"262 1-2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WACV56688.2023.00226\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WACV56688.2023.00226","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Relaxing Contrastiveness in Multimodal Representation Learning
Multimodal representation learning for images with paired raw texts can improve the usability and generality of the learned semantic concepts while significantly reducing annotation costs. In this paper, we explore the design space of loss functions in visual-linguistic pretraining frameworks and propose a novel Relaxed Contrastive (ReCo) objective, which act as a drop-in replacement of the widely used InfoNCE loss. The key insight of ReCo is to allow a relaxed negative space by not penalizing unpaired multimodal samples (i.e., negative pairs) that are already orthogonal or negatively correlated. Unlike the widely-used InfoNCE, which keeps repelling negative pairs as long as they are not anti-correlated, ReCo by design embraces more diversity and flexibility of the learned embeddings. We conduct exten-sive experiments using ReCo with state-of-the-art models by pretraining on the MIMIC-CXR dataset that consists of chest radiographs and free-text radiology reports, and eval-uating on the CheXpert dataset for multimodal retrieval and disease classification. Our ReCo achieves an absolute improvement of 2.9% over the InfoNCE baseline on the CheXpert Retrieval dataset in average retrieval precision and re-ports better or comparable performance in the linear evaluation and finetuning for classification. We further show that ReCo outperforms InfoNCE on the Flickr30K dataset by 1.7% in retrieval Recall@1, demonstrating the generalizability of our approach to natural images.