{"title":"SVFFNet:用于视频预测的尺度感知体素流融合网络","authors":"Yao Zhou , Jinpeng Wei , Xueyong Zhang , Yusong Zhai , Jian Wei","doi":"10.1016/j.cviu.2025.104520","DOIUrl":null,"url":null,"abstract":"<div><div>Video prediction is a challenging task due to the potential for various motion scales in the complex scene. The diversity of motion scales stems from the time-variant and object-dependent motion magnitudes, as well as the multiple image resolutions across datasets. However, the vast majority of frame forecasting networks do not distinguish between treatment of different motion scales. Therefore, their receptive field is normally insufficient to capture larger-scale motions. Those that do, often yield significant local distortions in the predicted images. The reasons lie in their fixed choice of scale factors and lack of cross-scale interaction between motion features. In this work, we propose a Scale-Aware Voxel Flow Fusion Network (SVFFNet) to address the motion scale inconsistency problem and fully integrate multi-scale feature. This network consists of a set of flow estimation blocks, each block containing a selector module and a fusion module. The selector module adaptively selects the appropriate scale-processing branch for the input frames, thus facilitating acquisition of more refined features for large-scale motion. The fusion module then combines these features with the original motion information via an attention mechanism, preserving the actually existing structural details. Experimental results on four widely used benchmark datasets demonstrate that our method outperforms previously published baselines for video prediction. The code is available at: <span><span>https://github.com/zyaojlu/SVFFNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104520"},"PeriodicalIF":3.5000,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SVFFNet: A Scale-Aware Voxel Flow Fusion Network for video prediction\",\"authors\":\"Yao Zhou , Jinpeng Wei , Xueyong Zhang , Yusong Zhai , Jian Wei\",\"doi\":\"10.1016/j.cviu.2025.104520\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Video prediction is a challenging task due to the potential for various motion scales in the complex scene. The diversity of motion scales stems from the time-variant and object-dependent motion magnitudes, as well as the multiple image resolutions across datasets. However, the vast majority of frame forecasting networks do not distinguish between treatment of different motion scales. Therefore, their receptive field is normally insufficient to capture larger-scale motions. Those that do, often yield significant local distortions in the predicted images. The reasons lie in their fixed choice of scale factors and lack of cross-scale interaction between motion features. In this work, we propose a Scale-Aware Voxel Flow Fusion Network (SVFFNet) to address the motion scale inconsistency problem and fully integrate multi-scale feature. This network consists of a set of flow estimation blocks, each block containing a selector module and a fusion module. The selector module adaptively selects the appropriate scale-processing branch for the input frames, thus facilitating acquisition of more refined features for large-scale motion. The fusion module then combines these features with the original motion information via an attention mechanism, preserving the actually existing structural details. Experimental results on four widely used benchmark datasets demonstrate that our method outperforms previously published baselines for video prediction. The code is available at: <span><span>https://github.com/zyaojlu/SVFFNet</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"261 \",\"pages\":\"Article 104520\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314225002437\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225002437","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
SVFFNet: A Scale-Aware Voxel Flow Fusion Network for video prediction
Video prediction is a challenging task due to the potential for various motion scales in the complex scene. The diversity of motion scales stems from the time-variant and object-dependent motion magnitudes, as well as the multiple image resolutions across datasets. However, the vast majority of frame forecasting networks do not distinguish between treatment of different motion scales. Therefore, their receptive field is normally insufficient to capture larger-scale motions. Those that do, often yield significant local distortions in the predicted images. The reasons lie in their fixed choice of scale factors and lack of cross-scale interaction between motion features. In this work, we propose a Scale-Aware Voxel Flow Fusion Network (SVFFNet) to address the motion scale inconsistency problem and fully integrate multi-scale feature. This network consists of a set of flow estimation blocks, each block containing a selector module and a fusion module. The selector module adaptively selects the appropriate scale-processing branch for the input frames, thus facilitating acquisition of more refined features for large-scale motion. The fusion module then combines these features with the original motion information via an attention mechanism, preserving the actually existing structural details. Experimental results on four widely used benchmark datasets demonstrate that our method outperforms previously published baselines for video prediction. The code is available at: https://github.com/zyaojlu/SVFFNet.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems