SAMR：用于一般多模态三维运动检索的对称屏蔽多模态建模

IF 3.7 2区工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Displays Pub Date : 2025-02-07 DOI:10.1016/j.displa.2025.102987

Yunhao Li , Sijing Wu , Yucheng Zhu , Wei Sun , Zhichao Zhang , Song Song , Guangtao Zhai

{"title":"SAMR：用于一般多模态三维运动检索的对称屏蔽多模态建模","authors":"Yunhao Li , Sijing Wu , Yucheng Zhu , Wei Sun , Zhichao Zhang , Song Song , Guangtao Zhai","doi":"10.1016/j.displa.2025.102987","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, text to 3d human motion retrieval has been a hot topic in computer vision. However, current existing methods utilize contrastive learning and motion reconstruction as the main proxy task. Although these methods achieve great performance, such simple strategies may cause the network to lose temporal motion information and distort the text feature, which may injury motion retrieval results. Meanwhile, current motion retrieval methods ignore the post processing for predicted similarity matrices. Considering these two problems, in this work, we present <strong>SAMR</strong>, an encoder–decoder based transformer framework with symmetric masked multi-modal information modeling. Concretely, we remove the KL divergence loss and reconstruct the motion and text inputs jointly. To enhance the robustness of our retrieval model, we also propose a mask modeling strategy. Our SAMR performs joint masking on both image and text inputs, during training, for each modality, we simultaneously reconstruct the original input modality and masked modality to stabilize the training. After training, we also utilize the dual softmax optimization method to improve the final performance. We conduct extensive experiments on both text-to-motion dataset and speech-to-motion dataset. The experimental results demonstrate that SAMR achieves the state-of-the-art performance in various cross-modal motion retrieval tasks including speech to motion and text to motion, showing great potential to serve as a general foundation motion retrieval framework.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"87 ","pages":"Article 102987"},"PeriodicalIF":3.7000,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SAMR: Symmetric masked multimodal modeling for general multi-modal 3D motion retrieval\",\"authors\":\"Yunhao Li , Sijing Wu , Yucheng Zhu , Wei Sun , Zhichao Zhang , Song Song , Guangtao Zhai\",\"doi\":\"10.1016/j.displa.2025.102987\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Recently, text to 3d human motion retrieval has been a hot topic in computer vision. However, current existing methods utilize contrastive learning and motion reconstruction as the main proxy task. Although these methods achieve great performance, such simple strategies may cause the network to lose temporal motion information and distort the text feature, which may injury motion retrieval results. Meanwhile, current motion retrieval methods ignore the post processing for predicted similarity matrices. Considering these two problems, in this work, we present <strong>SAMR</strong>, an encoder–decoder based transformer framework with symmetric masked multi-modal information modeling. Concretely, we remove the KL divergence loss and reconstruct the motion and text inputs jointly. To enhance the robustness of our retrieval model, we also propose a mask modeling strategy. Our SAMR performs joint masking on both image and text inputs, during training, for each modality, we simultaneously reconstruct the original input modality and masked modality to stabilize the training. After training, we also utilize the dual softmax optimization method to improve the final performance. We conduct extensive experiments on both text-to-motion dataset and speech-to-motion dataset. The experimental results demonstrate that SAMR achieves the state-of-the-art performance in various cross-modal motion retrieval tasks including speech to motion and text to motion, showing great potential to serve as a general foundation motion retrieval framework.</div></div>\",\"PeriodicalId\":50570,\"journal\":{\"name\":\"Displays\",\"volume\":\"87 \",\"pages\":\"Article 102987\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-02-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Displays\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0141938225000241\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938225000241","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

近年来，文本到三维人体运动的检索一直是计算机视觉领域的研究热点。然而，现有的方法以对比学习和运动重建作为主要的代理任务。虽然这些方法取得了很好的性能，但这些简单的策略可能会导致网络丢失时间运动信息和扭曲文本特征，从而影响运动检索结果。同时，目前的运动检索方法忽略了对预测的相似矩阵的后处理。考虑到这两个问题，在这项工作中，我们提出了SAMR，一个基于编码器-解码器的变压器框架，具有对称掩蔽多模态信息建模。具体来说，我们去除KL散度损失，并共同重建运动和文本输入。为了增强检索模型的鲁棒性，我们还提出了一种掩模建模策略。我们的SAMR对图像和文本输入进行联合掩蔽，在训练过程中，我们对每个模态同时重建原始输入模态和掩蔽模态以稳定训练。在训练后，我们还利用双softmax优化方法来提高最终的性能。我们在文本到运动数据集和语音到运动数据集上进行了广泛的实验。实验结果表明，SAMR在多种跨模态运动检索任务（包括语音到运动和文本到运动）中达到了最先进的性能，显示出作为通用基础运动检索框架的巨大潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SAMR: Symmetric masked multimodal modeling for general multi-modal 3D motion retrieval

Recently, text to 3d human motion retrieval has been a hot topic in computer vision. However, current existing methods utilize contrastive learning and motion reconstruction as the main proxy task. Although these methods achieve great performance, such simple strategies may cause the network to lose temporal motion information and distort the text feature, which may injury motion retrieval results. Meanwhile, current motion retrieval methods ignore the post processing for predicted similarity matrices. Considering these two problems, in this work, we present SAMR, an encoder–decoder based transformer framework with symmetric masked multi-modal information modeling. Concretely, we remove the KL divergence loss and reconstruct the motion and text inputs jointly. To enhance the robustness of our retrieval model, we also propose a mask modeling strategy. Our SAMR performs joint masking on both image and text inputs, during training, for each modality, we simultaneously reconstruct the original input modality and masked modality to stabilize the training. After training, we also utilize the dual softmax optimization method to improve the final performance. We conduct extensive experiments on both text-to-motion dataset and speech-to-motion dataset. The experimental results demonstrate that SAMR achieves the state-of-the-art performance in various cross-modal motion retrieval tasks including speech to motion and text to motion, showing great potential to serve as a general foundation motion retrieval framework.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Displays 工程技术-工程：电子与电气

CiteScore

4.60

自引率

25.60%

发文量

138

审稿时长

92 days

期刊介绍： Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface. Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.