Video Inference for Human Mesh Recovery with Vision Transformer

Hanbyel Cho, Jaesung Ahn, Yooshin Cho, Junmo Kim
{"title":"Video Inference for Human Mesh Recovery with Vision Transformer","authors":"Hanbyel Cho, Jaesung Ahn, Yooshin Cho, Junmo Kim","doi":"10.1109/FG57933.2023.10042731","DOIUrl":null,"url":null,"abstract":"Human Mesh Recovery (HMR) from an image is a challenging problem because of the inherent ambiguity of the task. Existing HMR methods utilized either temporal information or kinematic relationships to achieve higher accuracy, but there is no method using both. Hence, we propose “Video Inference for Human Mesh Recovery with Vision Transformer (HMR-ViT)” that can take into account both temporal and kinematic information. In HMR-ViT, a Temporal-kinematic Feature Image is constructed using feature vectors obtained from video frames by an image encoder. When generating the feature image, we use a Channel Rearranging Matrix (CRM) so that similar kinematic features could be located spatially close together. The feature image is then further encoded using Vision Transformer, and the SMPL pose and shape parameters are finally inferred using a regression network. Extensive evaluation on the 3DPW and Human3.6M datasets indicates that our method achieves a competitive performance in HMR.","PeriodicalId":318766,"journal":{"name":"2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FG57933.2023.10042731","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Human Mesh Recovery (HMR) from an image is a challenging problem because of the inherent ambiguity of the task. Existing HMR methods utilized either temporal information or kinematic relationships to achieve higher accuracy, but there is no method using both. Hence, we propose “Video Inference for Human Mesh Recovery with Vision Transformer (HMR-ViT)” that can take into account both temporal and kinematic information. In HMR-ViT, a Temporal-kinematic Feature Image is constructed using feature vectors obtained from video frames by an image encoder. When generating the feature image, we use a Channel Rearranging Matrix (CRM) so that similar kinematic features could be located spatially close together. The feature image is then further encoded using Vision Transformer, and the SMPL pose and shape parameters are finally inferred using a regression network. Extensive evaluation on the 3DPW and Human3.6M datasets indicates that our method achieves a competitive performance in HMR.
基于视觉变换的人体网格恢复视频推理
人体网格恢复(HMR)是一个具有挑战性的问题,因为其任务固有的模糊性。现有的HMR方法要么利用时间信息,要么利用运动学关系来获得更高的精度,但没有同时使用两者的方法。因此,我们提出了“基于视觉变压器的人体网格恢复视频推断(HMR-ViT)”,它可以同时考虑时间和运动信息。在HMR-ViT中,图像编码器使用从视频帧中获得的特征向量来构建时间-运动学特征图像。在生成特征图像时,我们使用通道重新排列矩阵(CRM),以便相似的运动特征可以在空间上靠近在一起。然后使用Vision Transformer对特征图像进行进一步编码,最后使用回归网络推断出SMPL的姿态和形状参数。对3DPW和Human3.6M数据集的广泛评估表明,我们的方法在HMR中取得了具有竞争力的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信