Motion-guided token prioritization and semantic degradation fusion for exo-to-ego cross-view video generation

IF 14.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-05-08 DOI:10.1016/j.inffus.2025.103273

Weipeng Hu , Jiun Tian Hoe , Runzhong Zhang , Yiming Yang , Haifeng Hu , Yap-Peng Tan

{"title":"Motion-guided token prioritization and semantic degradation fusion for exo-to-ego cross-view video generation","authors":"Weipeng Hu , Jiun Tian Hoe , Runzhong Zhang , Yiming Yang , Haifeng Hu , Yap-Peng Tan","doi":"10.1016/j.inffus.2025.103273","DOIUrl":null,"url":null,"abstract":"<div><div>Exocentric (third-person) to egocentric (first-person) cross-view video generation aims to synthesize the egocentric view of a video from an exocentric view. However, current techniques either use a sub-optimal image-based approach that ignores temporal information, or require target-view cues that limits application flexibility. In this paper, we tackle the challenging cue-free Exocentric-to-Egocentric Video Generation (E2VG) problem via a video-based method, called motion-guided Token Prioritization and semantic Degradation Fusion (TPDF). Taking into account motion cues can provide useful overlapping trails between the two views by tracking the movement of human and the interesting objects, the proposed motion-guided token prioritization incorporates motion cues to adaptively distinguish between informative and uninformative tokens. Specifically, Our design of the Motion-guided Spatial token Prioritization Transformer (MSPT) and the Motion-guided Temporal token Prioritization Transformer (MTPT) incorporates motion cues to adaptively identify patches/tokens as informative or uninformative with orthogonal constraints, ensuring accurate attention retrieval and spatial–temporal consistency in cross-view generation. Additionally, we present a Semantic Degradation Fusion (SDF) to progressively learn egocentric semantics through a degradation learning mechanism, enabling our model to infer egocentric-view content. By extending into a cascaded fashion, the Cascaded token Prioritization and Degradation fusion (CPD) enhances attention learning with informative tokens and fuses egocentric semantic at different levels of granularity. Extensive experiments demonstrate that our method is quantitatively and qualitatively superior to the state-of-the-art approaches.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103273"},"PeriodicalIF":14.7000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S156625352500346X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Exocentric (third-person) to egocentric (first-person) cross-view video generation aims to synthesize the egocentric view of a video from an exocentric view. However, current techniques either use a sub-optimal image-based approach that ignores temporal information, or require target-view cues that limits application flexibility. In this paper, we tackle the challenging cue-free Exocentric-to-Egocentric Video Generation (E2VG) problem via a video-based method, called motion-guided Token Prioritization and semantic Degradation Fusion (TPDF). Taking into account motion cues can provide useful overlapping trails between the two views by tracking the movement of human and the interesting objects, the proposed motion-guided token prioritization incorporates motion cues to adaptively distinguish between informative and uninformative tokens. Specifically, Our design of the Motion-guided Spatial token Prioritization Transformer (MSPT) and the Motion-guided Temporal token Prioritization Transformer (MTPT) incorporates motion cues to adaptively identify patches/tokens as informative or uninformative with orthogonal constraints, ensuring accurate attention retrieval and spatial–temporal consistency in cross-view generation. Additionally, we present a Semantic Degradation Fusion (SDF) to progressively learn egocentric semantics through a degradation learning mechanism, enabling our model to infer egocentric-view content. By extending into a cascaded fashion, the Cascaded token Prioritization and Degradation fusion (CPD) enhances attention learning with informative tokens and fuses egocentric semantic at different levels of granularity. Extensive experiments demonstrate that our method is quantitatively and qualitatively superior to the state-of-the-art approaches.

查看原文本刊更多论文

运动引导下的令牌优先级和语义退化融合的自外交叉视点视频生成

外中心（第三人称）到自我中心（第一人称）交叉视点视频生成的目的是从外中心视点合成视频的自我中心视点。然而，目前的技术要么使用次优的基于图像的方法，忽略了时间信息，要么需要目标视图线索，限制了应用程序的灵活性。在本文中，我们通过一种基于视频的方法，称为运动引导令牌优先化和语义退化融合（TPDF），解决了具有挑战性的无线索外心到自我中心视频生成（E2VG）问题。考虑到运动线索可以通过跟踪人类和感兴趣的物体的运动，在两个视图之间提供有用的重叠路径，提出的运动引导标记优先排序方法结合运动线索，自适应区分有信息和无信息的标记。具体来说，我们设计的运动引导空间令牌优先排序转换器（MSPT）和运动引导时间令牌优先排序转换器（MTPT）结合运动线索，在正交约束下自适应识别有信息或无信息的补丁/令牌，确保准确的注意力检索和跨视图生成的时空一致性。此外，我们提出了语义退化融合（SDF），通过退化学习机制逐步学习自我中心语义，使我们的模型能够推断自我中心视图内容。通过扩展到级联方式，级联标记优先级和降级融合（CPD）增强了信息标记的注意学习，并在不同粒度级别上融合了以自我为中心的语义。大量的实验表明，我们的方法在数量和质量上都优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.