{"title":"Multi-Modal Generative DeepFake Detection via Visual-Language Pretraining with Gate Fusion for Cognitive Computation","authors":"Guisheng Zhang, Mingliang Gao, Qilei Li, Wenzhe Zhai, Gwanggil Jeon","doi":"10.1007/s12559-024-10316-x","DOIUrl":null,"url":null,"abstract":"<p>With the widespread adoption of deep learning, there has been a notable increase in the prevalence of multimodal deepfake content. These deepfakes pose a substantial risk to both individual privacy and the security of their assets. In response to this pressing issue, researchers have undertaken substantial endeavors in utilizing generative AI and cognitive computation to leverage multimodal data to detect deepfakes. However, the efforts thus far have fallen short of fully exploiting the extensive reservoir of multimodal feature information, which leads to a deficiency in leveraging spatial information across multiple dimensions. In this study, we introduce a framework called Visual-Language Pretraining with Gate Fusion (VLP-GF), designed to identify multimodal deceptive content and enhance the accurate localization of manipulated regions within both images and textual annotations. Specifically, we introduce an adaptive fusion module tailored to integrate local and global information simultaneously. This module captures global context and local details concurrently, thereby improving the performance of image bounding-box grounding within the system. Additionally, to maximize the utilization of semantic information from diverse modalities, we incorporate a gating mechanism to strengthen the interaction of multimodal information further. Through a series of ablation experiments and comprehensive comparisons with state-of-the-art approaches on extensive benchmark datasets, we empirically demonstrate the superior efficacy of VLP-GF.</p>","PeriodicalId":51243,"journal":{"name":"Cognitive Computation","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cognitive Computation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s12559-024-10316-x","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
With the widespread adoption of deep learning, there has been a notable increase in the prevalence of multimodal deepfake content. These deepfakes pose a substantial risk to both individual privacy and the security of their assets. In response to this pressing issue, researchers have undertaken substantial endeavors in utilizing generative AI and cognitive computation to leverage multimodal data to detect deepfakes. However, the efforts thus far have fallen short of fully exploiting the extensive reservoir of multimodal feature information, which leads to a deficiency in leveraging spatial information across multiple dimensions. In this study, we introduce a framework called Visual-Language Pretraining with Gate Fusion (VLP-GF), designed to identify multimodal deceptive content and enhance the accurate localization of manipulated regions within both images and textual annotations. Specifically, we introduce an adaptive fusion module tailored to integrate local and global information simultaneously. This module captures global context and local details concurrently, thereby improving the performance of image bounding-box grounding within the system. Additionally, to maximize the utilization of semantic information from diverse modalities, we incorporate a gating mechanism to strengthen the interaction of multimodal information further. Through a series of ablation experiments and comprehensive comparisons with state-of-the-art approaches on extensive benchmark datasets, we empirically demonstrate the superior efficacy of VLP-GF.
期刊介绍:
Cognitive Computation is an international, peer-reviewed, interdisciplinary journal that publishes cutting-edge articles describing original basic and applied work involving biologically-inspired computational accounts of all aspects of natural and artificial cognitive systems. It provides a new platform for the dissemination of research, current practices and future trends in the emerging discipline of cognitive computation that bridges the gap between life sciences, social sciences, engineering, physical and mathematical sciences, and humanities.