Multimodal Variational Autoencoder: A Barycentric View.

Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence Pub Date : 2025-01-01 Epub Date: 2025-04-11 DOI:10.1609/aaai.v39i19.34209

Peijie Qiu, Wenhui Zhu, Sayantan Kumar, Xiwen Chen, Jin Yang, Xiaotong Sun, Abolfazl Razi, Yalin Wang, Aristeidis Sotiras

{"title":"Multimodal Variational Autoencoder: A Barycentric View.","authors":"Peijie Qiu, Wenhui Zhu, Sayantan Kumar, Xiwen Chen, Jin Yang, Xiaotong Sun, Abolfazl Razi, Yalin Wang, Aristeidis Sotiras","doi":"10.1609/aaai.v39i19.34209","DOIUrl":null,"url":null,"abstract":"<p><p>Multiple signal modalities, such as vision and sounds, are naturally present in real-world phenomena. Recently, there has been growing interest in learning generative models, in particular variational autoencoder (VAE), for multimodal representation learning especially in the case of missing modalities. The primary goal of these models is to learn a modality-invariant and modality-specific representation that characterizes information across multiple modalities. Previous attempts at multimodal VAEs approach this mainly through the lens of experts, aggregating unimodal inference distributions with a product of experts (PoE), a mixture of experts (MoE), or a combination of both. In this paper, we provide an alternative generic and theoretical formulation of multimodal VAE through the lens of barycenter. We first show that PoE and MoE are specific instances of barycenters, derived by minimizing the asymmetric weighted KL divergence to unimodal inference distributions. Our novel formulation extends these two barycenters to a more flexible choice by considering different types of divergences. In particular, we explore the Wasserstein barycenter defined by the 2-Wasserstein distance, which better preserves the geometry of unimodal distributions by capturing both modality-specific and modality-invariant representations compared to KL divergence. Empirical studies on three multimodal benchmarks demonstrated the effectiveness of the proposed method.</p>","PeriodicalId":74506,"journal":{"name":"Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence","volume":"39 19","pages":"20060-20068"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12360785/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1609/aaai.v39i19.34209","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/11 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Multiple signal modalities, such as vision and sounds, are naturally present in real-world phenomena. Recently, there has been growing interest in learning generative models, in particular variational autoencoder (VAE), for multimodal representation learning especially in the case of missing modalities. The primary goal of these models is to learn a modality-invariant and modality-specific representation that characterizes information across multiple modalities. Previous attempts at multimodal VAEs approach this mainly through the lens of experts, aggregating unimodal inference distributions with a product of experts (PoE), a mixture of experts (MoE), or a combination of both. In this paper, we provide an alternative generic and theoretical formulation of multimodal VAE through the lens of barycenter. We first show that PoE and MoE are specific instances of barycenters, derived by minimizing the asymmetric weighted KL divergence to unimodal inference distributions. Our novel formulation extends these two barycenters to a more flexible choice by considering different types of divergences. In particular, we explore the Wasserstein barycenter defined by the 2-Wasserstein distance, which better preserves the geometry of unimodal distributions by capturing both modality-specific and modality-invariant representations compared to KL divergence. Empirical studies on three multimodal benchmarks demonstrated the effectiveness of the proposed method.

查看原文本刊更多论文

多模态变分自编码器：以重心为中心的视图。

多种信号模式，如视觉和声音，自然存在于现实世界的现象中。最近，人们对学习生成模型越来越感兴趣，特别是变分自编码器（VAE），用于多模态表示学习，特别是在缺少模态的情况下。这些模型的主要目标是学习一种模态不变的和特定于模态的表示，这种表示表征了跨多个模态的信息。以前对多模态VAEs的尝试主要是通过专家的视角来解决这个问题，用专家的产品（PoE）、专家的混合物（MoE）或两者的组合来聚合单模态推理分布。本文通过质心透镜给出了多模态VAE的另一种通用的理论表述。我们首先证明PoE和MoE是质心的特定实例，通过最小化非对称加权KL散度到单峰推理分布而得到。我们的新配方通过考虑不同类型的分歧，将这两个重心扩展为更灵活的选择。特别是，我们探索了由2-Wasserstein距离定义的Wasserstein质心，与KL散度相比，它通过捕获模态特定和模态不变表示更好地保留了单峰分布的几何形状。对三个多模态基准的实证研究证明了该方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence

自引率

0.00%

发文量