A Multimodal Dynamical Variational Autoencoder for Audiovisual Speech Representation Learning

Proceedings of the 1st International Workshop on Methodologies for Multimedia Pub Date : 2022-10-14 DOI:10.1145/3552487.3556435

Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda, R. Séguier

{"title":"A Multimodal Dynamical Variational Autoencoder for Audiovisual Speech Representation Learning","authors":"Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda, R. Séguier","doi":"10.1145/3552487.3556435","DOIUrl":null,"url":null,"abstract":"High-dimensional data such as natural images or speech signals exhibit some form of regularity, preventing their dimensions from varying independently. This suggests that there exists a smaller dimensional latent representation from which the high-dimensional observed data were generated. Uncovering the hidden explanatory features of complex data is the goal of representation learning, and deep latent variable generative models have emerged as promising unsupervised approaches. In particular, the variational autoencoder (VAE) [1, 2], which is equipped with both a generative and inference model, allows for the analysis, transformation, and generation of various types of data. Over the past few years, the VAE has been extended in many ways, including for dealing with data that are either multimodal [3] or dynamical (i.e., sequential) [4]. In this talk, we will present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audiovisual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities (e.g., the speaker's lip movements) from those that are specific to each modality (e.g., the speaker's pitch variation or eye movements). A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence (e.g., the speaker's identity or global emotional state). The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two steps. In the first step, a vector quantized VAE (VQ-VAE) [5] is learned independently for each modality, without temporal modeling. The second step consists in learning the MDVAE, whose inputs are the intermediate representations of the VQ-VAE before quantization. The disentanglement between static versus dynamical and modality-specific versus shared information occurs during this second training stage. Experimental results will be presented, featuring what characteristics of the audiovisual speech data are encoded within the different latent spaces, how the proposed multimodal model can be beneficial compared with a unimodal one, and how the learned representation can be leveraged to perform downstream tasks.","PeriodicalId":274055,"journal":{"name":"Proceedings of the 1st International Workshop on Methodologies for Multimedia","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st International Workshop on Methodologies for Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3552487.3556435","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

High-dimensional data such as natural images or speech signals exhibit some form of regularity, preventing their dimensions from varying independently. This suggests that there exists a smaller dimensional latent representation from which the high-dimensional observed data were generated. Uncovering the hidden explanatory features of complex data is the goal of representation learning, and deep latent variable generative models have emerged as promising unsupervised approaches. In particular, the variational autoencoder (VAE) [1, 2], which is equipped with both a generative and inference model, allows for the analysis, transformation, and generation of various types of data. Over the past few years, the VAE has been extended in many ways, including for dealing with data that are either multimodal [3] or dynamical (i.e., sequential) [4]. In this talk, we will present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audiovisual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities (e.g., the speaker's lip movements) from those that are specific to each modality (e.g., the speaker's pitch variation or eye movements). A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence (e.g., the speaker's identity or global emotional state). The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two steps. In the first step, a vector quantized VAE (VQ-VAE) [5] is learned independently for each modality, without temporal modeling. The second step consists in learning the MDVAE, whose inputs are the intermediate representations of the VQ-VAE before quantization. The disentanglement between static versus dynamical and modality-specific versus shared information occurs during this second training stage. Experimental results will be presented, featuring what characteristics of the audiovisual speech data are encoded within the different latent spaces, how the proposed multimodal model can be beneficial compared with a unimodal one, and how the learned representation can be leveraged to perform downstream tasks.

查看原文本刊更多论文

一种用于视听语音表示学习的多模态动态变分自编码器

高维数据，如自然图像或语音信号，表现出某种形式的规律性，防止其维度独立变化。这表明存在一个较小维度的潜在表示，从中产生高维观测数据。揭示复杂数据的隐藏解释特征是表征学习的目标，深度潜在变量生成模型已经成为一种很有前途的无监督方法。特别是变分自编码器(VAE)[1,2]，它同时配备了生成和推理模型，允许对各种类型的数据进行分析、转换和生成。在过去的几年中，VAE在许多方面得到了扩展，包括处理多模态[3]或动态(即顺序)[4]的数据。在本次演讲中，我们将介绍一种应用于无监督视听语音表示学习的多模态动态VAE (MDVAE)。潜在空间的结构是为了将模态之间共享的潜在动态因素(例如，说话者的嘴唇运动)与特定于每种模态的潜在动态因素(例如，说话者的音调变化或眼球运动)分离开来。静态潜在变量也被引入来编码在视听语音序列中随时间不变的信息(例如，说话人的身份或整体情绪状态)。该模型以一种无监督的方式在一个视听情感语音数据集上进行训练，分为两个步骤。在第一步中，对每个模态独立学习矢量量化VAE (VQ-VAE)[5]，不需要进行时间建模。第二步包括学习MDVAE，其输入是量化前VQ-VAE的中间表示。静态与动态、模态特定与共享信息之间的分离发生在第二个训练阶段。实验结果将呈现，包括在不同的潜在空间内编码的视听语音数据的特征，所提出的多模态模型与单模态模型相比如何有益，以及如何利用学习到的表征来执行下游任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 1st International Workshop on Methodologies for Multimedia

自引率

0.00%

发文量