Junyi Tang, Simin An, Yuanwei Liu, Yong Su, Jin Chen
{"title":"M2AST:MLP-mixer-based adaptive spatial-temporal graph learning for human motion prediction","authors":"Junyi Tang, Simin An, Yuanwei Liu, Yong Su, Jin Chen","doi":"10.1007/s00530-024-01351-7","DOIUrl":null,"url":null,"abstract":"<p>Human motion prediction is a challenging task in human-centric computer vision, involving forecasting future poses based on historical sequences. Despite recent progress in modeling spatial-temporal relationships of motion sequences using complex structured graphs, few approaches have provided an adaptive and lightweight representation for varying graph structures of human motion. Taking inspiration from the advantages of MLP-Mixer, a lightweight architecture designed for learning complex interactions in multi-dimensional data, we explore its potential as a backbone for motion prediction. To this end, we propose a novel MLP-Mixer-based adaptive spatial-temporal pattern learning framework (M<span>\\(^2\\)</span>AST). Our framework includes an adaptive spatial mixer to model the spatial relationships between joints, an adaptive temporal mixer to learn temporal smoothness, and a local dynamic mixer to capture fine-grained cross-dependencies between joints of adjacent poses. The final method achieves a compact representation of human motion dynamics by adaptively considering spatial-temporal dependencies from coarse to fine. Unlike the trivial spatial-temporal MLP-Mixer, our proposed approach can more effectively capture both local and global spatial-temporal relationships simultaneously. We extensively evaluated our proposed framework on three commonly used benchmarks (Human3.6M, AMASS, 3DPW MoCap), demonstrating comparable or better performance than existing state-of-the-art methods in both short and long-term predictions, despite having significantly fewer parameters. Overall, our proposed framework provides a novel and efficient solution for human motion prediction with adaptive graph learning.</p>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00530-024-01351-7","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Human motion prediction is a challenging task in human-centric computer vision, involving forecasting future poses based on historical sequences. Despite recent progress in modeling spatial-temporal relationships of motion sequences using complex structured graphs, few approaches have provided an adaptive and lightweight representation for varying graph structures of human motion. Taking inspiration from the advantages of MLP-Mixer, a lightweight architecture designed for learning complex interactions in multi-dimensional data, we explore its potential as a backbone for motion prediction. To this end, we propose a novel MLP-Mixer-based adaptive spatial-temporal pattern learning framework (M\(^2\)AST). Our framework includes an adaptive spatial mixer to model the spatial relationships between joints, an adaptive temporal mixer to learn temporal smoothness, and a local dynamic mixer to capture fine-grained cross-dependencies between joints of adjacent poses. The final method achieves a compact representation of human motion dynamics by adaptively considering spatial-temporal dependencies from coarse to fine. Unlike the trivial spatial-temporal MLP-Mixer, our proposed approach can more effectively capture both local and global spatial-temporal relationships simultaneously. We extensively evaluated our proposed framework on three commonly used benchmarks (Human3.6M, AMASS, 3DPW MoCap), demonstrating comparable or better performance than existing state-of-the-art methods in both short and long-term predictions, despite having significantly fewer parameters. Overall, our proposed framework provides a novel and efficient solution for human motion prediction with adaptive graph learning.