COTA-motion: Controllable image-to-video synthesis with dense semantic trajectories

IF 6.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2025-09-26 DOI:10.1016/j.neucom.2025.131671

Yirui Chen , Wenqing Chu , Ye Wu , Jie Yang , Xiaonan Mao , Wei Liu

{"title":"COTA-motion: Controllable image-to-video synthesis with dense semantic trajectories","authors":"Yirui Chen , Wenqing Chu , Ye Wu , Jie Yang , Xiaonan Mao , Wei Liu","doi":"10.1016/j.neucom.2025.131671","DOIUrl":null,"url":null,"abstract":"<div><div>Motion transfer, which aims to animate an object in a static image by transferring motion from a reference video, remains a fundamental yet challenging task in content creation. While recent diffusion-based image-to-video models offer fine-grained control over visual appearance, most existing methods rely on ambiguous text prompts or coarse drag-based motion cues, making it difficult to achieve accurate and consistent motion synthesis. To address these limitations, we propose COTA-Motion, a general framework for controllable image-to-video motion transfer. Our method leverages a dense trajectory-based semantic representation extracted from the driving video to provide explicit motion guidance. Specifically, we segment the salient object and extract its point-wise trajectories across frames. These trajectories are enriched with semantic embeddings and reprojected into a spatial-temporal tensor, forming the motion embedding. To utilize this motion representation, we introduce the COTA Adapter, which integrates image content with semantic trajectories via cross-attention, enabling accurate and flexible control over the generated motion. At inference, we further incorporate an alignment module to address discrepancies between the input image and motion cues, ensuring spatial consistency. Built upon a pre-trained video diffusion model, COTA-Motion only requires lightweight fine-tuning on a small set of videos, and it enables high-quality, controllable motion transfer from video to image. Extensive experiments demonstrate the effectiveness of our approach in generating visually coherent and motion-aligned video outputs.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"657 ","pages":"Article 131671"},"PeriodicalIF":6.5000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225023434","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Motion transfer, which aims to animate an object in a static image by transferring motion from a reference video, remains a fundamental yet challenging task in content creation. While recent diffusion-based image-to-video models offer fine-grained control over visual appearance, most existing methods rely on ambiguous text prompts or coarse drag-based motion cues, making it difficult to achieve accurate and consistent motion synthesis. To address these limitations, we propose COTA-Motion, a general framework for controllable image-to-video motion transfer. Our method leverages a dense trajectory-based semantic representation extracted from the driving video to provide explicit motion guidance. Specifically, we segment the salient object and extract its point-wise trajectories across frames. These trajectories are enriched with semantic embeddings and reprojected into a spatial-temporal tensor, forming the motion embedding. To utilize this motion representation, we introduce the COTA Adapter, which integrates image content with semantic trajectories via cross-attention, enabling accurate and flexible control over the generated motion. At inference, we further incorporate an alignment module to address discrepancies between the input image and motion cues, ensuring spatial consistency. Built upon a pre-trained video diffusion model, COTA-Motion only requires lightweight fine-tuning on a small set of videos, and it enables high-quality, controllable motion transfer from video to image. Extensive experiments demonstrate the effectiveness of our approach in generating visually coherent and motion-aligned video outputs.

查看原文本刊更多论文

COTA-motion：具有密集语义轨迹的可控图像到视频合成

运动转移，其目的是通过从参考视频转移运动在静态图像动画对象，仍然是内容创作的基本但具有挑战性的任务。虽然最近基于扩散的图像到视频模型提供了对视觉外观的细粒度控制，但大多数现有方法依赖于模糊的文本提示或粗糙的基于拖动的运动线索，这使得难以实现准确和一致的运动合成。为了解决这些限制，我们提出了COTA-Motion，一种用于可控图像到视频运动传输的通用框架。我们的方法利用从驾驶视频中提取的基于密集轨迹的语义表示来提供明确的运动指导。具体来说，我们分割显著目标并提取其跨帧的逐点轨迹。这些轨迹被丰富的语义嵌入，并重新投射到一个时空张量，形成运动嵌入。为了利用这种运动表示，我们引入了COTA适配器，它通过交叉注意将图像内容与语义轨迹集成在一起，从而能够准确灵活地控制生成的运动。在推理中，我们进一步整合了一个对齐模块来解决输入图像和运动线索之间的差异，确保空间一致性。COTA-Motion建立在预先训练的视频扩散模型之上，只需要对一小部分视频进行轻量级微调，并且可以实现从视频到图像的高质量可控运动传输。大量的实验证明了我们的方法在生成视觉连贯和运动对齐的视频输出方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.