Yirui Chen , Wenqing Chu , Ye Wu , Jie Yang , Xiaonan Mao , Wei Liu
{"title":"COTA-motion:具有密集语义轨迹的可控图像到视频合成","authors":"Yirui Chen , Wenqing Chu , Ye Wu , Jie Yang , Xiaonan Mao , Wei Liu","doi":"10.1016/j.neucom.2025.131671","DOIUrl":null,"url":null,"abstract":"<div><div>Motion transfer, which aims to animate an object in a static image by transferring motion from a reference video, remains a fundamental yet challenging task in content creation. While recent diffusion-based image-to-video models offer fine-grained control over visual appearance, most existing methods rely on ambiguous text prompts or coarse drag-based motion cues, making it difficult to achieve accurate and consistent motion synthesis. To address these limitations, we propose COTA-Motion, a general framework for controllable image-to-video motion transfer. Our method leverages a dense trajectory-based semantic representation extracted from the driving video to provide explicit motion guidance. Specifically, we segment the salient object and extract its point-wise trajectories across frames. These trajectories are enriched with semantic embeddings and reprojected into a spatial-temporal tensor, forming the motion embedding. To utilize this motion representation, we introduce the COTA Adapter, which integrates image content with semantic trajectories via cross-attention, enabling accurate and flexible control over the generated motion. At inference, we further incorporate an alignment module to address discrepancies between the input image and motion cues, ensuring spatial consistency. Built upon a pre-trained video diffusion model, COTA-Motion only requires lightweight fine-tuning on a small set of videos, and it enables high-quality, controllable motion transfer from video to image. Extensive experiments demonstrate the effectiveness of our approach in generating visually coherent and motion-aligned video outputs.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"657 ","pages":"Article 131671"},"PeriodicalIF":6.5000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"COTA-motion: Controllable image-to-video synthesis with dense semantic trajectories\",\"authors\":\"Yirui Chen , Wenqing Chu , Ye Wu , Jie Yang , Xiaonan Mao , Wei Liu\",\"doi\":\"10.1016/j.neucom.2025.131671\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Motion transfer, which aims to animate an object in a static image by transferring motion from a reference video, remains a fundamental yet challenging task in content creation. While recent diffusion-based image-to-video models offer fine-grained control over visual appearance, most existing methods rely on ambiguous text prompts or coarse drag-based motion cues, making it difficult to achieve accurate and consistent motion synthesis. To address these limitations, we propose COTA-Motion, a general framework for controllable image-to-video motion transfer. Our method leverages a dense trajectory-based semantic representation extracted from the driving video to provide explicit motion guidance. Specifically, we segment the salient object and extract its point-wise trajectories across frames. These trajectories are enriched with semantic embeddings and reprojected into a spatial-temporal tensor, forming the motion embedding. To utilize this motion representation, we introduce the COTA Adapter, which integrates image content with semantic trajectories via cross-attention, enabling accurate and flexible control over the generated motion. At inference, we further incorporate an alignment module to address discrepancies between the input image and motion cues, ensuring spatial consistency. Built upon a pre-trained video diffusion model, COTA-Motion only requires lightweight fine-tuning on a small set of videos, and it enables high-quality, controllable motion transfer from video to image. Extensive experiments demonstrate the effectiveness of our approach in generating visually coherent and motion-aligned video outputs.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"657 \",\"pages\":\"Article 131671\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2025-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225023434\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225023434","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
COTA-motion: Controllable image-to-video synthesis with dense semantic trajectories
Motion transfer, which aims to animate an object in a static image by transferring motion from a reference video, remains a fundamental yet challenging task in content creation. While recent diffusion-based image-to-video models offer fine-grained control over visual appearance, most existing methods rely on ambiguous text prompts or coarse drag-based motion cues, making it difficult to achieve accurate and consistent motion synthesis. To address these limitations, we propose COTA-Motion, a general framework for controllable image-to-video motion transfer. Our method leverages a dense trajectory-based semantic representation extracted from the driving video to provide explicit motion guidance. Specifically, we segment the salient object and extract its point-wise trajectories across frames. These trajectories are enriched with semantic embeddings and reprojected into a spatial-temporal tensor, forming the motion embedding. To utilize this motion representation, we introduce the COTA Adapter, which integrates image content with semantic trajectories via cross-attention, enabling accurate and flexible control over the generated motion. At inference, we further incorporate an alignment module to address discrepancies between the input image and motion cues, ensuring spatial consistency. Built upon a pre-trained video diffusion model, COTA-Motion only requires lightweight fine-tuning on a small set of videos, and it enables high-quality, controllable motion transfer from video to image. Extensive experiments demonstrate the effectiveness of our approach in generating visually coherent and motion-aligned video outputs.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.