Multimodal Cascaded Framework with Metric Learning Robust to Missing Modalities for Person Classification

Vijay John, Yasutomo Kawanishi
{"title":"Multimodal Cascaded Framework with Metric Learning Robust to Missing Modalities for Person Classification","authors":"Vijay John, Yasutomo Kawanishi","doi":"10.1145/3587819.3590989","DOIUrl":null,"url":null,"abstract":"This paper addresses the missing modality problem in multimodal person classification, where an incomplete multimodal input with one modality missing is classified into predefined person classes. A multimodal cascaded framework with three deep learning models is proposed, where model parameters, outputs, and latent space learnt at a given step are transferred to the model in a subsequent step. The cascaded framework addresses the missing modality problem by, firstly, generating the complete multimodal data from the incomplete multimodal data in the feature space via a latent space. Subsequently, the generated and original multimodal features are effectively merged and embedded into a final latent space to estimate the person label. During the learning phase, the cascaded framework uses two novel latent loss functions, the missing modality joint loss, and latent prior loss to learn the different latent spaces. The missing modality joint loss ensures that the similar class latent data are close to each other, even if a modality is missing. In the cascaded framework, the latent prior loss learns the final latent space using a previously learnt latent space as a prior. The proposed framework is validated on the audio-visible RAVDESS and the visible-thermal Speaking Faces datasets. A detailed comparative analysis and an ablation analysis are performed, which demonstrate that the proposed framework enhances the robustness of person classification even under conditions of missing modalities, reporting an average of 21.75% increase and 25.73% increase over the baseline algorithms on the RAVDESS and Speaking Faces datasets.","PeriodicalId":330983,"journal":{"name":"Proceedings of the 14th Conference on ACM Multimedia Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th Conference on ACM Multimedia Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3587819.3590989","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This paper addresses the missing modality problem in multimodal person classification, where an incomplete multimodal input with one modality missing is classified into predefined person classes. A multimodal cascaded framework with three deep learning models is proposed, where model parameters, outputs, and latent space learnt at a given step are transferred to the model in a subsequent step. The cascaded framework addresses the missing modality problem by, firstly, generating the complete multimodal data from the incomplete multimodal data in the feature space via a latent space. Subsequently, the generated and original multimodal features are effectively merged and embedded into a final latent space to estimate the person label. During the learning phase, the cascaded framework uses two novel latent loss functions, the missing modality joint loss, and latent prior loss to learn the different latent spaces. The missing modality joint loss ensures that the similar class latent data are close to each other, even if a modality is missing. In the cascaded framework, the latent prior loss learns the final latent space using a previously learnt latent space as a prior. The proposed framework is validated on the audio-visible RAVDESS and the visible-thermal Speaking Faces datasets. A detailed comparative analysis and an ablation analysis are performed, which demonstrate that the proposed framework enhances the robustness of person classification even under conditions of missing modalities, reporting an average of 21.75% increase and 25.73% increase over the baseline algorithms on the RAVDESS and Speaking Faces datasets.
基于度量学习的多模态级联框架对缺失模态具有鲁棒性
本文解决了多模态人分类中的情态缺失问题,将缺失一个情态的不完整多模态输入分类到预定义的人类别中。提出了一个具有三个深度学习模型的多模态级联框架,其中在给定步骤中学习的模型参数、输出和潜在空间被转移到后续步骤的模型中。级联框架解决了模态缺失问题,首先,通过潜在空间将特征空间中的不完整多模态数据生成完整多模态数据;然后,将生成的多模态特征与原始多模态特征有效地合并并嵌入到最终的潜在空间中,以估计人物标签。在学习阶段,级联框架使用两个新的潜在损失函数,缺失模态联合损失和潜在先验损失来学习不同的潜在空间。缺失模态联合损失保证了相似的类潜在数据彼此接近,即使一个模态丢失。在级联框架中,潜在先验损失使用先前学习的潜在空间作为先验来学习最终的潜在空间。在音频-可见RAVDESS和可见热讲话面数据集上对该框架进行了验证。进行了详细的对比分析和消融分析,结果表明,即使在模式缺失的情况下,所提出的框架也增强了人物分类的鲁棒性,在RAVDESS和Speaking Faces数据集上,与基线算法相比,平均提高了21.75%和25.73%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信