PADVG: A Simple Baseline of Active Protection for Audio-driven Video Generation

IF 5.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-01-16 DOI:10.1145/3638556

Huan Liu, Xiaolong Liu, Zichang Tan, Xiaolong Li, Yao Zhao

{"title":"PADVG: A Simple Baseline of Active Protection for Audio-driven Video Generation","authors":"Huan Liu, Xiaolong Liu, Zichang Tan, Xiaolong Li, Yao Zhao","doi":"10.1145/3638556","DOIUrl":null,"url":null,"abstract":"Over the past few years, deep generative models have significantly evolved, enabling the synthesis of realistic content and also bringing security concerns of illegal misuse. Therefore, active protection for generative models has been proposed recently, aiming to generate samples with hidden messages for future identification while preserving the original generating performance. However, existing active protection methods are specifically designed for generative adversarial networks (GANs), restricted to handling unconditional image generation. We observe that they get limited identification performance and visual quality when handling audio-driven video generation conditioned on target audio and source input to drive video generation with consistent context, e.g., identity and movement, between frame sequences. To address this issue, we introduce a simple yet effective active Protection framework for Audio-Driven Video Generation, named PADVG. To be specific, we present a novel frame-shared embedding module in which messages to hide are first transformed into frame-shared message coefficients. Then, these coefficients are assembled with the intermediate feature maps of video generators at multiple feature levels to generate the embedded video frames. Besides, PADVG further considers two visual consistent losses: i) intra-frame loss is utilized to keep the visual consistency with different hidden messages; ii) inter-frame loss is used to preserve the visual consistency across different video frames. Moreover, we also propose an auxiliary denoising training strategy through perturbing the assembled features by learnable pixel-level noise to improve identification performance, while enhancing robustness against real-world disturbances. Extensive experiments demonstrate that our proposed PADVG for audio-driven video generation can effectively identify the generated videos and achieve high visual quality.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"281 1","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Multimedia Computing Communications and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3638556","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Over the past few years, deep generative models have significantly evolved, enabling the synthesis of realistic content and also bringing security concerns of illegal misuse. Therefore, active protection for generative models has been proposed recently, aiming to generate samples with hidden messages for future identification while preserving the original generating performance. However, existing active protection methods are specifically designed for generative adversarial networks (GANs), restricted to handling unconditional image generation. We observe that they get limited identification performance and visual quality when handling audio-driven video generation conditioned on target audio and source input to drive video generation with consistent context, e.g., identity and movement, between frame sequences. To address this issue, we introduce a simple yet effective active Protection framework for Audio-Driven Video Generation, named PADVG. To be specific, we present a novel frame-shared embedding module in which messages to hide are first transformed into frame-shared message coefficients. Then, these coefficients are assembled with the intermediate feature maps of video generators at multiple feature levels to generate the embedded video frames. Besides, PADVG further considers two visual consistent losses: i) intra-frame loss is utilized to keep the visual consistency with different hidden messages; ii) inter-frame loss is used to preserve the visual consistency across different video frames. Moreover, we also propose an auxiliary denoising training strategy through perturbing the assembled features by learnable pixel-level noise to improve identification performance, while enhancing robustness against real-world disturbances. Extensive experiments demonstrate that our proposed PADVG for audio-driven video generation can effectively identify the generated videos and achieve high visual quality.

查看原文本刊更多论文

PADVG：为音频驱动视频生成提供主动保护的简单基线

在过去几年中，深度生成模型得到了长足的发展，能够合成逼真的内容，同时也带来了非法滥用的安全问题。因此，最近有人提出了对生成模型的主动保护，目的是在保持原始生成性能的同时，生成带有隐藏信息的样本，以便将来进行识别。然而，现有的主动保护方法是专门为生成式对抗网络（GANs）设计的，仅限于处理无条件图像生成。我们发现，在处理以目标音频和源输入为条件的音频驱动视频生成时，这些方法的识别性能和视觉质量都很有限，而在帧序列之间以一致的上下文（如身份和运动）驱动视频生成时，这些方法的识别性能和视觉质量都很有限。为了解决这个问题，我们为音频驱动视频生成引入了一个简单而有效的主动保护框架，命名为 PADVG。具体来说，我们提出了一种新颖的帧共享嵌入模块，首先将需要隐藏的信息转化为帧共享信息系数。然后，将这些系数与视频生成器的中间特征图在多个特征级别上进行组合，生成嵌入的视频帧。此外，PADVG 还进一步考虑了两种视觉一致性损失：i) 利用帧内损失来保持不同隐藏信息的视觉一致性；ii) 利用帧间损失来保持不同视频帧的视觉一致性。此外，我们还提出了一种辅助去噪训练策略，即通过可学习的像素级噪声对集合特征进行扰动，以提高识别性能，同时增强对现实世界干扰的鲁棒性。大量实验证明，我们提出的用于音频驱动视频生成的 PADVG 能有效识别生成的视频，并获得较高的视觉质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Multimedia Computing Communications and Applications 工程技术-计算机：理论方法

CiteScore

8.50

自引率

5.90%

发文量

285

审稿时长

7.5 months

期刊介绍： The ACM Transactions on Multimedia Computing, Communications, and Applications is the flagship publication of the ACM Special Interest Group in Multimedia (SIGMM). It is soliciting paper submissions on all aspects of multimedia. Papers on single media (for instance, audio, video, animation) and their processing are also welcome. TOMM is a peer-reviewed, archival journal, available in both print form and digital form. The Journal is published quarterly; with roughly 7 23-page articles in each issue. In addition, all Special Issues are published online-only to ensure a timely publication. The transactions consists primarily of research papers. This is an archival journal and it is intended that the papers will have lasting importance and value over time. In general, papers whose primary focus is on particular multimedia products or the current state of the industry will not be included.