{"title":"HandDiff:手姿预测的时空扩散模型","authors":"Jinguang Tong , Kaihao Zhang","doi":"10.1016/j.inffus.2025.103647","DOIUrl":null,"url":null,"abstract":"<div><div>We propose a novel problem of forecasting future 3D hand pose from a short past sequence. The primary challenge in this task is accurately modeling the stochastic nature of future hand movements. To address this, we propose a diffusion-based hand pose forecasting model designed to generate accurate future hand poses by leveraging spatial–temporal information. Our model incorporates a Spatial–Temporal Attention Module (STAM) to capture correlations between hand joints and time points, and a Coarse Forecasting Module (CFM) to extract limited explicit guidance from the temporal dimension. These features condition the diffusion model to forecast plausible future hand poses. Due to the lack of suitable datasets, we also construct two large-scale datasets based on the existing hand-object interaction (HOI) datasets HO-3D and HOI4D for benchmarking hand pose forecasting, covering both third-person and egocentric perspectives. Experimental results show that our method HandDiff significantly outperforms other state-of-the-art (SOTA) methods by 16.7% on the HO-3D dataset and 11.1% on the HOI4D dataset in terms of the mean per joint position error (MPJPE), respectively.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103647"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HandDiff: Spatial–temporal diffusion model for hand pose forecasting\",\"authors\":\"Jinguang Tong , Kaihao Zhang\",\"doi\":\"10.1016/j.inffus.2025.103647\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>We propose a novel problem of forecasting future 3D hand pose from a short past sequence. The primary challenge in this task is accurately modeling the stochastic nature of future hand movements. To address this, we propose a diffusion-based hand pose forecasting model designed to generate accurate future hand poses by leveraging spatial–temporal information. Our model incorporates a Spatial–Temporal Attention Module (STAM) to capture correlations between hand joints and time points, and a Coarse Forecasting Module (CFM) to extract limited explicit guidance from the temporal dimension. These features condition the diffusion model to forecast plausible future hand poses. Due to the lack of suitable datasets, we also construct two large-scale datasets based on the existing hand-object interaction (HOI) datasets HO-3D and HOI4D for benchmarking hand pose forecasting, covering both third-person and egocentric perspectives. Experimental results show that our method HandDiff significantly outperforms other state-of-the-art (SOTA) methods by 16.7% on the HO-3D dataset and 11.1% on the HOI4D dataset in terms of the mean per joint position error (MPJPE), respectively.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"127 \",\"pages\":\"Article 103647\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525007195\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525007195","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
HandDiff: Spatial–temporal diffusion model for hand pose forecasting
We propose a novel problem of forecasting future 3D hand pose from a short past sequence. The primary challenge in this task is accurately modeling the stochastic nature of future hand movements. To address this, we propose a diffusion-based hand pose forecasting model designed to generate accurate future hand poses by leveraging spatial–temporal information. Our model incorporates a Spatial–Temporal Attention Module (STAM) to capture correlations between hand joints and time points, and a Coarse Forecasting Module (CFM) to extract limited explicit guidance from the temporal dimension. These features condition the diffusion model to forecast plausible future hand poses. Due to the lack of suitable datasets, we also construct two large-scale datasets based on the existing hand-object interaction (HOI) datasets HO-3D and HOI4D for benchmarking hand pose forecasting, covering both third-person and egocentric perspectives. Experimental results show that our method HandDiff significantly outperforms other state-of-the-art (SOTA) methods by 16.7% on the HO-3D dataset and 11.1% on the HOI4D dataset in terms of the mean per joint position error (MPJPE), respectively.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.