Zhe Xu, Yuan Li, Yuhong Li, Songlin Du, T. Ikenaga
{"title":"基于位移的分层时空神经网络单眼头姿预测","authors":"Zhe Xu, Yuan Li, Yuhong Li, Songlin Du, T. Ikenaga","doi":"10.23919/MVA57639.2023.10216167","DOIUrl":null,"url":null,"abstract":"Head pose prediction aims to forecast future head pose given observed sequence, which plays an increasingly important role in human computer interaction, virtual reality, and driver monitoring. However, since there are many moving possibilities, current head pose works, mainly focusing on estimation, fail to provide sufficient temporal information to meet the high demands for accurate predictions. This paper proposes (A) a Spatio-Temporal Encoder (STE), (B) a displacement based offset generating module, and (C) a time step feature aggregation module. The STE extracts spatial information via Transformer and temporal information according to the time order of frames. The displacement based offset generating module utilizes displacement information through a frequency domain process between adjacent frames to generate an offset to refine the prediction result. Furthermore, the time step feature aggregation module integrates time step features based on the information density and hierarchically extracts past motion information as prior knowledge to capture the motion recurrence. Extensive experiments have shown that the proposed network outperforms related methods, achieving a Mean Absolute Error (MAE) of 4.5865° on simple background sequences and 7.1325° on complex background sequences.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hierarchical Spatio-Temporal Neural Network with Displacement Based Refinement for Monocular Head Pose Prediction\",\"authors\":\"Zhe Xu, Yuan Li, Yuhong Li, Songlin Du, T. Ikenaga\",\"doi\":\"10.23919/MVA57639.2023.10216167\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Head pose prediction aims to forecast future head pose given observed sequence, which plays an increasingly important role in human computer interaction, virtual reality, and driver monitoring. However, since there are many moving possibilities, current head pose works, mainly focusing on estimation, fail to provide sufficient temporal information to meet the high demands for accurate predictions. This paper proposes (A) a Spatio-Temporal Encoder (STE), (B) a displacement based offset generating module, and (C) a time step feature aggregation module. The STE extracts spatial information via Transformer and temporal information according to the time order of frames. The displacement based offset generating module utilizes displacement information through a frequency domain process between adjacent frames to generate an offset to refine the prediction result. Furthermore, the time step feature aggregation module integrates time step features based on the information density and hierarchically extracts past motion information as prior knowledge to capture the motion recurrence. Extensive experiments have shown that the proposed network outperforms related methods, achieving a Mean Absolute Error (MAE) of 4.5865° on simple background sequences and 7.1325° on complex background sequences.\",\"PeriodicalId\":338734,\"journal\":{\"name\":\"2023 18th International Conference on Machine Vision and Applications (MVA)\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 18th International Conference on Machine Vision and Applications (MVA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/MVA57639.2023.10216167\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 18th International Conference on Machine Vision and Applications (MVA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/MVA57639.2023.10216167","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Hierarchical Spatio-Temporal Neural Network with Displacement Based Refinement for Monocular Head Pose Prediction
Head pose prediction aims to forecast future head pose given observed sequence, which plays an increasingly important role in human computer interaction, virtual reality, and driver monitoring. However, since there are many moving possibilities, current head pose works, mainly focusing on estimation, fail to provide sufficient temporal information to meet the high demands for accurate predictions. This paper proposes (A) a Spatio-Temporal Encoder (STE), (B) a displacement based offset generating module, and (C) a time step feature aggregation module. The STE extracts spatial information via Transformer and temporal information according to the time order of frames. The displacement based offset generating module utilizes displacement information through a frequency domain process between adjacent frames to generate an offset to refine the prediction result. Furthermore, the time step feature aggregation module integrates time step features based on the information density and hierarchically extracts past motion information as prior knowledge to capture the motion recurrence. Extensive experiments have shown that the proposed network outperforms related methods, achieving a Mean Absolute Error (MAE) of 4.5865° on simple background sequences and 7.1325° on complex background sequences.