Weipeng Hu , Jiun Tian Hoe , Runzhong Zhang , Yiming Yang , Haifeng Hu , Yap-Peng Tan
{"title":"Motion-guided token prioritization and semantic degradation fusion for exo-to-ego cross-view video generation","authors":"Weipeng Hu , Jiun Tian Hoe , Runzhong Zhang , Yiming Yang , Haifeng Hu , Yap-Peng Tan","doi":"10.1016/j.inffus.2025.103273","DOIUrl":null,"url":null,"abstract":"<div><div>Exocentric (third-person) to egocentric (first-person) cross-view video generation aims to synthesize the egocentric view of a video from an exocentric view. However, current techniques either use a sub-optimal image-based approach that ignores temporal information, or require target-view cues that limits application flexibility. In this paper, we tackle the challenging cue-free Exocentric-to-Egocentric Video Generation (E2VG) problem via a video-based method, called motion-guided Token Prioritization and semantic Degradation Fusion (TPDF). Taking into account motion cues can provide useful overlapping trails between the two views by tracking the movement of human and the interesting objects, the proposed motion-guided token prioritization incorporates motion cues to adaptively distinguish between informative and uninformative tokens. Specifically, Our design of the Motion-guided Spatial token Prioritization Transformer (MSPT) and the Motion-guided Temporal token Prioritization Transformer (MTPT) incorporates motion cues to adaptively identify patches/tokens as informative or uninformative with orthogonal constraints, ensuring accurate attention retrieval and spatial–temporal consistency in cross-view generation. Additionally, we present a Semantic Degradation Fusion (SDF) to progressively learn egocentric semantics through a degradation learning mechanism, enabling our model to infer egocentric-view content. By extending into a cascaded fashion, the Cascaded token Prioritization and Degradation fusion (CPD) enhances attention learning with informative tokens and fuses egocentric semantic at different levels of granularity. Extensive experiments demonstrate that our method is quantitatively and qualitatively superior to the state-of-the-art approaches.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103273"},"PeriodicalIF":14.7000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S156625352500346X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Exocentric (third-person) to egocentric (first-person) cross-view video generation aims to synthesize the egocentric view of a video from an exocentric view. However, current techniques either use a sub-optimal image-based approach that ignores temporal information, or require target-view cues that limits application flexibility. In this paper, we tackle the challenging cue-free Exocentric-to-Egocentric Video Generation (E2VG) problem via a video-based method, called motion-guided Token Prioritization and semantic Degradation Fusion (TPDF). Taking into account motion cues can provide useful overlapping trails between the two views by tracking the movement of human and the interesting objects, the proposed motion-guided token prioritization incorporates motion cues to adaptively distinguish between informative and uninformative tokens. Specifically, Our design of the Motion-guided Spatial token Prioritization Transformer (MSPT) and the Motion-guided Temporal token Prioritization Transformer (MTPT) incorporates motion cues to adaptively identify patches/tokens as informative or uninformative with orthogonal constraints, ensuring accurate attention retrieval and spatial–temporal consistency in cross-view generation. Additionally, we present a Semantic Degradation Fusion (SDF) to progressively learn egocentric semantics through a degradation learning mechanism, enabling our model to infer egocentric-view content. By extending into a cascaded fashion, the Cascaded token Prioritization and Degradation fusion (CPD) enhances attention learning with informative tokens and fuses egocentric semantic at different levels of granularity. Extensive experiments demonstrate that our method is quantitatively and qualitatively superior to the state-of-the-art approaches.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.