ACM Transactions on Multimedia Computing Communications and Applications最新文献_第9页

Building Category Graphs Representation with Spatial and Temporal Attention for Visual Navigation 利用空间和时间注意力构建类别图表征，实现视觉导航

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-22 DOI: 10.1145/3653714

Xiaobo Hu, Youfang Lin, HeHe Fan, Shuo Wang, Zhihao Wu, Kai Lv

{"title":"Building Category Graphs Representation with Spatial and Temporal Attention for Visual Navigation","authors":"Xiaobo Hu, Youfang Lin, HeHe Fan, Shuo Wang, Zhihao Wu, Kai Lv","doi":"10.1145/3653714","DOIUrl":"https://doi.org/10.1145/3653714","url":null,"abstract":"Given an object of interest, visual navigation aims to reach the object’s location based on a sequence of partial observations. To this end, an agent needs to 1) acquire specific knowledge about the relations of object categories in the world during training and 2) locate the target object based on the pre-learned object category relations and its trajectory in the current unseen environment. In this paper, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region attention (TSR) architecture to perceive the long-term spatial-temporal dependencies of objects, aiding navigation. We establish CRG to learn prior knowledge of object layout and deduce the positions of specific objects. Subsequently, we propose the TSR architecture to capture relationships among objects in temporal, spatial, and regions within observation trajectories. Specifically, we implement a Temporal attention module (T) to model the temporal structure of the observation sequence, implicitly encoding historical moving or trajectory information. Then, a Spatial attention module (S) uncovers the spatial context of the current observation objects based on CRG and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Leveraging the visual representation extracted by our method, the agent accurately perceives the environment and easily learns a superior navigation policy. Experiments on AI2-THOR demonstrate that our CRG-TSR method significantly outperforms existing methods in both effectiveness and efficiency. The supplementary material includes the code and will be publicly available.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"16 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recoverable Privacy-Preserving Image Classification through Noise-like Adversarial Examples 通过类噪声对抗实例进行可恢复的隐私保护图像分类

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-21 DOI: 10.1145/3653676

Jun Liu, Jiantao Zhou, Jinyu Tian, Weiwei Sun

{"title":"Recoverable Privacy-Preserving Image Classification through Noise-like Adversarial Examples","authors":"Jun Liu, Jiantao Zhou, Jinyu Tian, Weiwei Sun","doi":"10.1145/3653676","DOIUrl":"https://doi.org/10.1145/3653676","url":null,"abstract":"With the increasing prevalence of cloud computing platforms, ensuring data privacy during the cloud-based image-related services such as classification has become crucial. In this study, we propose a novel privacy-preserving image classification scheme that enables the direct application of classifiers trained in the plaintext domain to classify encrypted images, without the need of retraining a dedicated classifier. Moreover, encrypted images can be decrypted back into their original form with high fidelity (recoverable) using a secret key. Specifically, our proposed scheme involves utilizing a feature extractor and an encoder to mask the plaintext image through a newly designed Noise-like Adversarial Example (NAE). Such an NAE not only introduces a noise-like visual appearance to the encrypted image but also compels the target classifier to predict the ciphertext as the same label as the original plaintext image. At the decoding phase, we adopt a Symmetric Residual Learning (SRL) framework for restoring the plaintext image with minimal degradation. Extensive experiments demonstrate that 1) the classification accuracy of the classifier trained in the plaintext domain remains the same in both the ciphertext and plaintext domains; 2) the encrypted images can be recovered into their original form with an average PSNR of up to 51+ dB for the SVHN dataset and 48+ dB for the VGGFace2 dataset; 3) our system exhibits satisfactory generalization capability on the encryption, decryption and classification tasks across datasets that are different from the training one; and 4) a high-level of security is achieved against three potential threat models. The code is available at https://github.com/csjunjun/RIC.git.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"26 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Feature Extraction Matters More: An Effective and Efficient Universal Deepfake Disruptor 特征提取更重要高效通用的深度伪造干扰器

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-20 DOI: 10.1145/3653457

Long Tang, Dengpan Ye, Zhenhao Lu, Yunming Zhang, Chuanxi Chen

{"title":"Feature Extraction Matters More: An Effective and Efficient Universal Deepfake Disruptor","authors":"Long Tang, Dengpan Ye, Zhenhao Lu, Yunming Zhang, Chuanxi Chen","doi":"10.1145/3653457","DOIUrl":"https://doi.org/10.1145/3653457","url":null,"abstract":"Face manipulation can modify a victim’s facial attributes, e.g., age or hair color, in an image, which is an important component of DeepFakes. Adversarial examples are an emerging approach to combat the threat of visual misinformation to society. To efficiently protect facial images from being forged, designing a universal face anti-manipulation disruptor is essential. However, existing works treat deepfake disruption as an end-to-end process, ignoring the functional difference between feature extraction and image reconstruction. In this work, we propose a novel Feature-Output ensemble UNiversal Disruptor (FOUND) against face manipulation networks, which explores a new opinion considering attacking feature-extraction (encoding) modules as the critical task in deepfake disruption. We conduct an effective two-stage disruption process. We first perform ensemble disruption on multi-model encoders, maximizing the Wasserstein distance between features before and after the adversarial attack. Then develop a gradient-ensemble strategy to enhance the disruption effect by simplifying the complex optimization problem of disrupting ensemble end-to-end models. Extensive experiments indicate that one FOUND generated with a few facial images can successfully disrupt multiple face manipulation models on cross-attribute and cross-face images, surpassing state-of-the-art universal disruptors in both success rate and efficiency.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"22 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140166015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

New Metrics and Dataset for Biological Development Video Generation 生成生物发展视频的新指标和数据集

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-20 DOI: 10.1145/3653456

P. Celard, E. L. Iglesias, J. M. Sorribes-Fdez, L. Borrajo, A. Seara Vieira

{"title":"New Metrics and Dataset for Biological Development Video Generation","authors":"P. Celard, E. L. Iglesias, J. M. Sorribes-Fdez, L. Borrajo, A. Seara Vieira","doi":"10.1145/3653456","DOIUrl":"https://doi.org/10.1145/3653456","url":null,"abstract":"Image generative models have advanced in many areas to produce synthetic images of high resolution and detail. This success has enabled its use in the biomedical field, paving the way for the generation of videos showing the biological evolution of its content. Despite the power of generative video models, their use has not yet extended to time-based development, focusing almost exclusively on generating motion in space. This situation is largely due to the lack of specific data sets and metrics to measure the individual quality of videos, particularly when there is no ground truth available for comparison. We propose a new dataset, called GoldenDOT, which tracks the evolution of apples cut in parallel over 10 days, allowing to observe their progress over time while remaining static. In addition, four new metrics are proposed that provide different analyses of the generated videos as a whole and individually. In this paper, the proposed dataset and measures are used to study three state of the art video generative models and their feasibility for video generation with biological development: TemporalGAN (TGANv2), Low Dimensional Video Discriminator GAN (LDVDGAN), and Video Diffusion Model (VDM). Among them, the TGANv2 model has managed to obtain the best results in the vast majority of metrics, including those already known in the state of the art, demonstrating the viability of the new proposed metrics and their congruence with these standard measures.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"103 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140165844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generative Adversarial Networks with Learnable Auxiliary Module for Image Synthesis 具有可学习辅助模块的生成式对抗网络用于图像合成

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-17 DOI: 10.1145/3653021

Yan Gan, Chenxue Yang, Mao Ye, Renjie Huang, Deqiang Ouyang

{"title":"Generative Adversarial Networks with Learnable Auxiliary Module for Image Synthesis","authors":"Yan Gan, Chenxue Yang, Mao Ye, Renjie Huang, Deqiang Ouyang","doi":"10.1145/3653021","DOIUrl":"https://doi.org/10.1145/3653021","url":null,"abstract":"Training generative adversarial networks (GANs) for noise-to-image synthesis is a challenge task, primarily due to the instability of GANs’ training process. One of the key issues is the generator’s sensitivity to input data, which can cause sudden fluctuations in the generator’s loss value with certain inputs. This sensitivity suggests an inadequate ability to resist disturbances in the generator, causing the discriminator’s loss value to oscillate and negatively impacting the discriminator. Then, the negative feedback of discriminator is also not conducive to updating generator’s parameters, leading to suboptimal image generation quality. In response to this challenge, we present an innovative GANs model equipped with a learnable auxiliary module that processes auxiliary noise. The core objective of this module is to enhance the stability of both the generator and discriminator throughout the training process. To achieve this target, we incorporate a learnable auxiliary penalty and an augmented discriminator, designed to control the generator and reinforce the discriminator’s stability, respectively. We further apply our method to the Hinge and LSGANs loss functions, illustrating its efficacy in reducing the instability of both the generator and the discriminator. The tests we conducted on LSUN, CelebA, Market-1501 and Creative Senz3D datasets serve as proof of our method’s ability to improve the training stability and overall performance of the baseline methods.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"26 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140155385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Make Partition Fit Task: A Novel Framework for Joint Learning of City Region Partition and Representation 使分区适合任务：城市区域划分与表征的联合学习新框架

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-17 DOI: 10.1145/3652857

Mingyu Deng, Wanyi Zhang, Jie Zhao, Zhu Wang, Mingliang Zhou, Jun Luo, Chao Chen

{"title":"Make Partition Fit Task: A Novel Framework for Joint Learning of City Region Partition and Representation","authors":"Mingyu Deng, Wanyi Zhang, Jie Zhao, Zhu Wang, Mingliang Zhou, Jun Luo, Chao Chen","doi":"10.1145/3652857","DOIUrl":"https://doi.org/10.1145/3652857","url":null,"abstract":"The proliferation of multimodal big data in cities provides unprecedented opportunities for modeling and forecasting urban problems, e.g., crime prediction and house price prediction, through data-driven approaches. A fundamental and critical issue in modeling and forecasting urban problems lies in identifying suitable spatial analysis units, also known as city region partition. Existing works rely on subjective domain knowledge for static partitions, which is general and universal for all tasks. In fact, different tasks may need different city region partitions. To address this issue, we propose a task-oriented framework for <underline>J</underline>oint <underline>L</underline>earning of region <underline>P</underline>artition and <underline>R</underline>epresentation (JLPR for short hereafter). To make partition fit task, JLPR integrates the region partition into the representation model training and learns region partitions using the supervision signal from the downstream task. We evaluate the framework on two prediction tasks (i.e., crime prediction and housing price prediction) in Chicago. Experiments show that JLPR consistently outperforms state-of-the-art partitioning methods in both tasks, which achieves above 25% and 70% performance improvements in terms of Mean Absolute Error (MAE) for crime prediction and house price prediction tasks, respectively. Additionally, we meticulously undertake three visualization case studies, which yield profound and illuminating findings from diverse perspectives, demonstrating the remarkable effectiveness and superiority of our approach.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"33 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140165958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Realizing Efficient On-Device Language-based Image Retrieval 实现基于语言的高效设备上图像检索

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-15 DOI: 10.1145/3649896

Zhiming Hu, Mete Kemertas, Lan Xiao, Caleb Phillips, Iqbal Mohomed, Afsaneh Fazly

{"title":"Realizing Efficient On-Device Language-based Image Retrieval","authors":"Zhiming Hu, Mete Kemertas, Lan Xiao, Caleb Phillips, Iqbal Mohomed, Afsaneh Fazly","doi":"10.1145/3649896","DOIUrl":"https://doi.org/10.1145/3649896","url":null,"abstract":"Advances in deep learning have enabled accurate language-based search and retrieval, e.g., over user photos, in the cloud. Many users prefer to store their photos in the home due to privacy concerns. As such, a need arises for models that can perform cross-modal search on resource-limited devices. State-of-the-art cross-modal retrieval models achieve high accuracy through learning entangled representations that enable fine-grained similarity calculation between a language query and an image, but at the expense of having a prohibitively high retrieval latency. Alternatively, there is a new class of methods that exhibits good performance with low latency, but requires a lot more computational resources, and an order of magnitude more training data (i.e. large web-scraped datasets consisting of millions of image-caption pairs) making them infeasible to use in a commercial context. From a pragmatic perspective, none of the existing methods are suitable for developing commercial applications for low-latency cross-modal retrieval on low-resource devices. We propose CrispSearch, a cascaded approach that greatly reduces the retrieval latency with minimal loss in ranking accuracy for on-device language-based image retrieval. The idea behind our approach is to combine a light-weight and runtime-efficient coarse model with a fine re-ranking stage. Given a language query, the coarse model effectively filters out many of the irrelevant image candidates. After this filtering, only a handful of strong candidates will be selected and sent to a fine model for re-ranking. Extensive experimental results with two SOTA models for the fine re-ranking stage, on standard benchmark datasets show that CrispSearch results in a speedup of up to 38 times over the SOTA fine methods with negligible performance degradation. Moreover, our method does not require millions of training instances, making it a pragmatic solution to on-device search and retrieval.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"99 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140155127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Agent DRL-based Multipath Scheduling for Video Streaming with QUIC 基于 DRL 的多代理 QUIC 视频流多路径调度

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-15 DOI: 10.1145/3649139

Xueqiang Han, Biao Han, Jinrong Li, Congxi Song

{"title":"Multi-Agent DRL-based Multipath Scheduling for Video Streaming with QUIC","authors":"Xueqiang Han, Biao Han, Jinrong Li, Congxi Song","doi":"10.1145/3649139","DOIUrl":"https://doi.org/10.1145/3649139","url":null,"abstract":"The popularization of video streaming brings challenges in satisfying diverse Quality of Service (QoS) requirements. The multipath extension of the Quick UDP Internet Connection (QUIC) protocol, also called MPQUIC, has the potential to improve video streaming performance with multiple simultaneously transmitting paths. The multipath scheduler of MPQUIC determines how to distribute the packets onto different paths. However, while applying current multipath schedulers into MPQUIC, our experimental results show that they fail to adapt to various receive buffer sizes of different devices and comprehensive QoS requirements of video streaming. These problems are especially severe under heterogeneous and dynamic network environments. To tackle these problems, we propose MARS, a <underline>M</underline>ulti-<underline>A</underline>gent deep <underline>R</underline>einforcement learning (MADRL) based Multipath QUIC <underline>S</underline>cheduler, which is able to promptly adapt to dynamic network environments. It exploits the MADRL method to learn a neural network for each path and generate scheduling policy. Besides, it introduces a novel multi-objective reward function that takes out-of-order (OFO) queue size and different QoS metrics into consideration to realize adaptive scheduling optimization. We implement MARS in an MPQUIC prototype and deploy in Dynamic Adaptive Streaming over HTTP (DASH) system. Then we compare it with the state-of-the-art multipath schedulers in both emulated and real-world networks. Experimental results show that MARS outperforms the other schedulers with better adaptive capability regarding the receive buffer sizes and QoS.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"154 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140156772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Invisible Adversarial Watermarking: A Novel Security Mechanism for Enhancing Copyright Protection 隐形对抗水印：加强版权保护的新型安全机制

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-14 DOI: 10.1145/3652608

Jinwei Wang, Haihua Wang, Jiawei Zhang, Hao Wu, Xiangyang Luo, Bin Ma

{"title":"Invisible Adversarial Watermarking: A Novel Security Mechanism for Enhancing Copyright Protection","authors":"Jinwei Wang, Haihua Wang, Jiawei Zhang, Hao Wu, Xiangyang Luo, Bin Ma","doi":"10.1145/3652608","DOIUrl":"https://doi.org/10.1145/3652608","url":null,"abstract":"Invisible watermarking can be used as an important tool for copyright certification in the Metaverse. However, with the advent of deep learning, Deep Neural Networks (DNNs) have posed new threats to this technique. For example, artificially trained DNNs can perform unauthorized content analysis and achieve illegal access to protected images. Furthermore, some specially crafted DNNs may even erase invisible watermarks embedded within the protected images, which eventually leads to the collapse of this protection and certification mechanism. To address these issues, inspired by the adversarial attack, we introduce Invisible Adversarial Watermarking (IAW), a novel security mechanism to enhance the copyright protection efficacy of watermarks. Specifically, we design an Adversarial Watermarking Fusion Model (AWFM) to efficiently generate Invisible Adversarial Watermark Images (IAWIs). By modeling the embedding of watermarks and adversarial perturbations as a unified task, the generated IAWIs can effectively defend against unauthorized identification, access, and erase via DNNs, and identify the ownership by extracting the embedded watermark. Experimental results show that the proposed IAW presents superior extraction accuracy, attack ability, and robustness on different DNNs, and the protected images maintain good visual quality, which ensures its effectiveness as an image protection mechanism.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"36 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140126190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Audio-Visual Contrastive Pre-train for Face Forgery Detection 用于人脸伪造检测的视听对比预训练

IF 5.1 3区计算机科学

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-13 DOI: 10.1145/3651311

Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Weiming Zhang, Ying Guo, Zhen Cheng, Pengfei Yan, Nenghai Yu

{"title":"Audio-Visual Contrastive Pre-train for Face Forgery Detection","authors":"Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Weiming Zhang, Ying Guo, Zhen Cheng, Pengfei Yan, Nenghai Yu","doi":"10.1145/3651311","DOIUrl":"https://doi.org/10.1145/3651311","url":null,"abstract":"The highly realistic avatar in the metaverse may lead to severe leakage of facial privacy. Malicious users can more easily obtain the 3D structure of faces, thus using Deepfake technology to create counterfeit videos with higher realism. To automatically discern facial videos forged with the advancing generation techniques, deepfake detectors need to achieve stronger generalization abilities. Inspired by transfer learning, neural networks pre-trained on other large-scale face-related tasks would provide fundamental features for deepfake detection. We propose a video-level deepfake detection method based on a temporal transformer with a self-supervised audio-visual contrastive learning approach for pre-training the deepfake detector. The proposed method learns motion representations in the mouth region by encouraging the paired video and audio representations to be close while unpaired ones to be diverse. The deepfake detector adopts the pre-trained weights and partially fine-tunes on deepfake datasets. Extensive experiments show that our self-supervised pre-training method can effectively improve the accuracy and robustness of our deepfake detection model without extra human efforts. Compared with existing deepfake detection methods, our proposed method achieves better generalization ability in cross-dataset evaluations.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"11 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140126189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0