Multimodal Cascaded Framework with Metric Learning Robust to Missing Modalities for Person Classification

Proceedings of the 14th Conference on ACM Multimedia Systems Pub Date : 2023-06-07 DOI:10.1145/3587819.3590989

Vijay John, Yasutomo Kawanishi

{"title":"Multimodal Cascaded Framework with Metric Learning Robust to Missing Modalities for Person Classification","authors":"Vijay John, Yasutomo Kawanishi","doi":"10.1145/3587819.3590989","DOIUrl":null,"url":null,"abstract":"This paper addresses the missing modality problem in multimodal person classification, where an incomplete multimodal input with one modality missing is classified into predefined person classes. A multimodal cascaded framework with three deep learning models is proposed, where model parameters, outputs, and latent space learnt at a given step are transferred to the model in a subsequent step. The cascaded framework addresses the missing modality problem by, firstly, generating the complete multimodal data from the incomplete multimodal data in the feature space via a latent space. Subsequently, the generated and original multimodal features are effectively merged and embedded into a final latent space to estimate the person label. During the learning phase, the cascaded framework uses two novel latent loss functions, the missing modality joint loss, and latent prior loss to learn the different latent spaces. The missing modality joint loss ensures that the similar class latent data are close to each other, even if a modality is missing. In the cascaded framework, the latent prior loss learns the final latent space using a previously learnt latent space as a prior. The proposed framework is validated on the audio-visible RAVDESS and the visible-thermal Speaking Faces datasets. A detailed comparative analysis and an ablation analysis are performed, which demonstrate that the proposed framework enhances the robustness of person classification even under conditions of missing modalities, reporting an average of 21.75% increase and 25.73% increase over the baseline algorithms on the RAVDESS and Speaking Faces datasets.","PeriodicalId":330983,"journal":{"name":"Proceedings of the 14th Conference on ACM Multimedia Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th Conference on ACM Multimedia Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3587819.3590989","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This paper addresses the missing modality problem in multimodal person classification, where an incomplete multimodal input with one modality missing is classified into predefined person classes. A multimodal cascaded framework with three deep learning models is proposed, where model parameters, outputs, and latent space learnt at a given step are transferred to the model in a subsequent step. The cascaded framework addresses the missing modality problem by, firstly, generating the complete multimodal data from the incomplete multimodal data in the feature space via a latent space. Subsequently, the generated and original multimodal features are effectively merged and embedded into a final latent space to estimate the person label. During the learning phase, the cascaded framework uses two novel latent loss functions, the missing modality joint loss, and latent prior loss to learn the different latent spaces. The missing modality joint loss ensures that the similar class latent data are close to each other, even if a modality is missing. In the cascaded framework, the latent prior loss learns the final latent space using a previously learnt latent space as a prior. The proposed framework is validated on the audio-visible RAVDESS and the visible-thermal Speaking Faces datasets. A detailed comparative analysis and an ablation analysis are performed, which demonstrate that the proposed framework enhances the robustness of person classification even under conditions of missing modalities, reporting an average of 21.75% increase and 25.73% increase over the baseline algorithms on the RAVDESS and Speaking Faces datasets.

查看原文本刊更多论文

基于度量学习的多模态级联框架对缺失模态具有鲁棒性

本文解决了多模态人分类中的情态缺失问题，将缺失一个情态的不完整多模态输入分类到预定义的人类别中。提出了一个具有三个深度学习模型的多模态级联框架，其中在给定步骤中学习的模型参数、输出和潜在空间被转移到后续步骤的模型中。级联框架解决了模态缺失问题，首先，通过潜在空间将特征空间中的不完整多模态数据生成完整多模态数据;然后，将生成的多模态特征与原始多模态特征有效地合并并嵌入到最终的潜在空间中，以估计人物标签。在学习阶段，级联框架使用两个新的潜在损失函数，缺失模态联合损失和潜在先验损失来学习不同的潜在空间。缺失模态联合损失保证了相似的类潜在数据彼此接近，即使一个模态丢失。在级联框架中，潜在先验损失使用先前学习的潜在空间作为先验来学习最终的潜在空间。在音频-可见RAVDESS和可见热讲话面数据集上对该框架进行了验证。进行了详细的对比分析和消融分析，结果表明，即使在模式缺失的情况下，所提出的框架也增强了人物分类的鲁棒性，在RAVDESS和Speaking Faces数据集上，与基线算法相比，平均提高了21.75%和25.73%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 14th Conference on ACM Multimedia Systems

自引率

0.00%

发文量