{"title":"Multi-domain awareness for compressed deepfake videos detection over social networks guided by common mechanisms between artifacts","authors":"","doi":"10.1016/j.cviu.2024.104072","DOIUrl":"10.1016/j.cviu.2024.104072","url":null,"abstract":"<div><p>The viral spread of massive deepfake videos over social networks has caused serious security problems. Despite the remarkable advancements achieved by existing deepfake detection algorithms, deepfake videos over social networks are inevitably influenced by compression factors. This causes deepfake detection performance to be limited by the following challenging issues: (a) interfering with compression artifacts, (b) loss of feature information, and (c) aliasing of feature distributions. In this paper, we analyze the common mechanism between compression artifacts and deepfake artifacts, revealing the structural similarity between them and providing a reliable theoretical basis for enhancing the robustness of deepfake detection models against compression. Firstly, based on the common mechanism between artifacts, we design a frequency domain adaptive notch filter to eliminate the interference of compression artifacts on specific frequency bands. Secondly, to reduce the sensitivity of deepfake detection models to unknown noise, we propose a spatial residual denoising strategy. Thirdly, to exploit the intrinsic correlation between feature vectors in the frequency domain branch and the spatial domain branch, we enhance deepfake features using an attention-based feature fusion method. Finally, we adopt a multi-task decision approach to enhance the discriminative power of the latent space representation of deepfakes, achieving deepfake detection with robustness against compression. Extensive experiments show that compared with the baseline methods, the detection performance of the proposed algorithm on compressed deepfake videos has been significantly improved. In particular, our model is resistant to various types of noise disturbances and can be easily combined with baseline detection models to improve their robustness.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141715999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval","authors":"","doi":"10.1016/j.cviu.2024.104071","DOIUrl":"10.1016/j.cviu.2024.104071","url":null,"abstract":"<div><p>Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking applications, with more structured input data, is still little investigated. In this work, we propose to leverage these techniques for structured-text based computational cuisine tasks. Our strategy, dubbed VLPCook, first transforms existing image-text pairs to image and structured-text pairs. This allows to pretrain our VLPCook model using VLP objectives adapted to the structured data of the resulting datasets, then finetuning it on downstream computational cooking tasks. During finetuning, we also enrich the visual encoder, leveraging pretrained foundation models (<em>e.g.</em> CLIP) to provide local and global textual context. VLPCook outperforms current SoTA by a significant margin (+3.3 Recall@1 absolute improvement) on the task of Cross-Modal Food Retrieval on the large Recipe1M dataset. We conduct further experiments on VLP to validate their importance, especially on the Recipe1M+ dataset. Finally, we validate the generalization of the approach to other tasks (<em>i.e</em>, Food Recognition) and domains with structured text such as the Medical domain on the ROCO dataset. The code will be made publicly available.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141962288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modality adaptation via feature difference learning for depth human parsing","authors":"","doi":"10.1016/j.cviu.2024.104070","DOIUrl":"10.1016/j.cviu.2024.104070","url":null,"abstract":"<div><p>In the field of human parsing, depth data offers unique advantages over RGB data due to its illumination invariance and geometric detail, which motivates us to explore human parsing with only depth input. However, depth data is challenging to collect at scale due to the specialized equipment required. In contrast, RGB data is readily available in large quantities, presenting an opportunity to enhance depth-only parsing models with semantic knowledge learned from RGB data. However, fully finetuning the RGB-pretrained encoder leads to high training costs and inflexible domain generalization, while keeping the encoder frozen suffers from a large RGB-depth modality gap and restricts the parsing performance. To alleviate the limitations of these naive approaches, we introduce a Modality Adaptation pipeline via Feature Difference Learning (MAFDL) which leverages the RGB knowledge to facilitate depth human parsing. A Difference-Guided Depth Adapter (DGDA) is proposed within MAFDL to learn the feature differences between RGB and depth modalities, adapting depth features into RGB feature space to bridge the modality gap. Furthermore, we also design a Feature Alignment Constraint (FAC) to impose explicit alignment supervision at pixel and batch levels, making the modality adaptation more comprehensive. Extensive experiments on the NTURGBD-Parsing-4K dataset show that our method surpasses previous state-of-the-art approaches.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141637936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implicit and explicit commonsense for multi-sentence video captioning","authors":"","doi":"10.1016/j.cviu.2024.104064","DOIUrl":"10.1016/j.cviu.2024.104064","url":null,"abstract":"<div><p>Existing dense or paragraph video captioning approaches rely on holistic representations of videos, possibly coupled with learned object/action representations, to condition hierarchical language decoders. However, they fundamentally lack the commonsense knowledge of the world required to reason about progression of events, causality, and even the function of certain objects within a scene. To address this limitation we propose a novel video captioning Transformer-based model, that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge. We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions. Further, inspired by imitation learning, we propose a new task of instruction generation, where the goal is to produce a set of linguistic instructions from a video demonstration of its performance. We formalize the task using the ALFRED dataset generated using an AI2-THOR environment. While instruction generation is conceptually similar to paragraph captioning, it differs in the fact that it exhibits stronger object persistence, as well as spatially-aware and causal sentence structure. We show that our commonsense knowledge enhanced approach produces significant improvements on this task (up to 57% in METEOR and 8.5% in CIDEr), as well as the state-of-the-art result on more traditional video captioning in the ActivityNet Captions dataset.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224001450/pdfft?md5=10aaeba9fc35361fc930cce1f7a01fe8&pid=1-s2.0-S1077314224001450-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141732463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Di Wu , Jun Wang , Wei Zou , Shaodong Zou , Juxiang Zhou , Jianhou Gan
{"title":"Classroom teacher action recognition based on spatio-temporal dual-branch feature fusion","authors":"Di Wu , Jun Wang , Wei Zou , Shaodong Zou , Juxiang Zhou , Jianhou Gan","doi":"10.1016/j.cviu.2024.104068","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104068","url":null,"abstract":"<div><p>The classroom teaching action recognition task refers to recognizing and understanding teacher action through video temporal and spatial information. Due to complex backgrounds and significant occlusions, recognizing teacher action in the classroom environment poses substantial challenges. In this study, we propose a classroom teacher action recognition approach based on a spatio-temporal dual-branch feature fusion architecture, where the core task involves utilizing continuous human keypoint heatmap information and single-frame image information. Specifically, we fuse features from two modalities to propose a method combining image spatial information with temporal human keypoint heatmap information for teacher action recognition. Our approach ensures recognition accuracy while reducing the model’s parameters and computational complexity. Additionally, we constructed a teacher action dataset (CTA) in a real classroom environment, comprising 12 action categories, 13k+ video segments, and a total duration exceeding 15 h. The experimental results on the CTA dataset validate the effectiveness of our proposed method. Our research explores action recognition tasks in real complex classroom environments, providing a technical framework for classroom teaching intelligent analysis.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141596268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhanced dual contrast representation learning with cell separation and merging for breast cancer diagnosis","authors":"","doi":"10.1016/j.cviu.2024.104065","DOIUrl":"10.1016/j.cviu.2024.104065","url":null,"abstract":"<div><p>Breast cancer remains a prevalent malignancy impacting a substantial number of individuals globally. In recent times, there has been a growing trend of combining deep learning methods with breast cancer diagnosis. Nevertheless, this integration encounters challenges, including limited data availability, class imbalance, and the absence of fine-grained labels to safeguard patient privacy and accommodate experience-dependent detection. To address these issues, we propose an effective framework by a dual contrast representation learning with a cell separation and merging strategy. The proposed algorithm comprises three main components: the cell separation and merging part, the dual contrast representation learning part, and the multi-category classification part. The cell separation and merging part takes an unpaired set of histopathological images as input and produces two sets of separated image layers, through the exploration of latent semantic information using SAM. Subsequently, these separated image layers are utilized to generate two new unpaired histopathological images via a cell separation and merging approach based on the linear superimposition model, with an inpainting network being employed to refine image details. Thus the class imbalance problem is alleviated and the data size is enlarged for a sufficient CNN training. The second part introduces a dual contrast representation learning framework for these generated images, with one branch designed for the positive samples (tumor cells) and the other for the negative samples (normal cells). The contrast learning network effectively minimizes the distance between two generated positive samples while maximizing the similarity of intra-class images to enhance feature representation. Leveraging the facilitated feature representation acquired from the dual contrast representation learning part, a pre-trained classifier is further fine-tuned to predict breast cancer categories. Extensive quantitative and qualitative experimental results validates the superiority of our proposed method compared to other state-of-the-art methods on the BreaKHis dataset in terms of four measurement metrics.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141700981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-label image classification using adaptive graph convolutional networks: From a single domain to multiple domains","authors":"Inder Pal Singh , Enjie Ghorbel , Oyebade Oyedotun , Djamila Aouada","doi":"10.1016/j.cviu.2024.104062","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104062","url":null,"abstract":"<div><p>This paper proposes an adaptive graph-based approach for multi-label image classification. Graph-based methods have been largely exploited in the field of multi-label classification, given their ability to model label correlations. Specifically, their effectiveness has been proven not only when considering a single domain but also when taking into account multiple domains. However, the topology of the used graph is not optimal as it is pre-defined heuristically. In addition, consecutive Graph Convolutional Network (GCN) aggregations tend to destroy the feature similarity. To overcome these issues, an architecture for learning the graph connectivity in an end-to-end fashion is introduced. This is done by integrating an attention-based mechanism and a similarity-preserving strategy. The proposed framework is then extended to multiple domains using an adversarial training scheme. Numerous experiments are reported on well-known single-domain and multi-domain benchmarks. The results demonstrate that our approach achieves competitive results in terms of mean Average Precision (mAP) and model size as compared to the state-of-the-art. The code will be made publicly available.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224001437/pdfft?md5=0c261f58e8fe19e830f04f80492395f1&pid=1-s2.0-S1077314224001437-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pseudo initialization based Few-Shot Class Incremental Learning","authors":"Mingwen Shao , Xinkai Zhuang , Lixu Zhang , Wangmeng Zuo","doi":"10.1016/j.cviu.2024.104067","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104067","url":null,"abstract":"<div><p>Few-Shot Class Incremental Learning (FSCIL) aims to recognize sequentially arriving new classes without catastrophic forgetting old classes. The incremental new classes only contain very few labeled examples for updating the model, which causes overfitting problem. Current popular reserving embedding space method Forward Compatible Training preserves feature space for novel classes. Base class is pushed away from the most similar virtual class, preparing for the incoming novel classes. However, this can lead to pushing the base class to other similar virtual classes. In this paper, we propose a novel FSCIL method in order to overcome the aforementioned problem. Specifically, our core idea is pushing base classes away from the most similar top-K virtual classes to reserve feature space and provide pseudo initialization for the incoming novel classes. To further encourage learning new classes without forgetting, an additional regularization is applied to limit the extent of model updating. Extensive experiments are conducted on CUB200, CIFAR100 and mini-ImageNet, illustrating the performance of our proposed method. The results show that our method outperforms the state-of-the-art method and achieves significant improvement.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141596270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rui Wang , Caijuan Shi , Changyu Duan , Weixiang Gao , Hongli Zhu , Yunchao Wei , Meiqin Liu
{"title":"Camouflaged object segmentation with prior via two-stage training","authors":"Rui Wang , Caijuan Shi , Changyu Duan , Weixiang Gao , Hongli Zhu , Yunchao Wei , Meiqin Liu","doi":"10.1016/j.cviu.2024.104061","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104061","url":null,"abstract":"<div><p>The camouflaged object segmentation (COS) task aims to segment objects visually embedded within the background. Existing models usually rely on prior information as an auxiliary means to identify camouflaged objects. However, low-quality priors and the singular guidance form hinder the effective utilization of prior information. To address these issues, we propose a novel approach for prior generation and guidance, named prior-guided transformer (PGT). For prior generation, we design a prior generation subnetwork consisting of a Transformer backbone and simple convolutions to obtain higher-quality priors at a lower cost. In addition, to fully exploit the backbone’s understanding capabilities of the camouflage characteristics, a novel two-stage training method is proposed to achieve the backbone’s deep supervision. For prior guidance, we design a prior guidance modules (PGM), with distinct space token mixers to respectively capture global dependencies of location priors and local details of boundary priors. Additionally, we introduce a cross-level prior in the form of features to facilitate inter-level communication of backbone features. Extensive experiments have been conducted and experimental results illustrate the effectiveness and superiority of our method. The code is available at <span>https://github.com/Ray3417/PGT</span><svg><path></path></svg>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141540520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiawei Liu , Xun Gong , Tingting Wang , Yunfeng Hu , Hong Chen
{"title":"A proxy-data-based hierarchical adversarial patch generation method","authors":"Jiawei Liu , Xun Gong , Tingting Wang , Yunfeng Hu , Hong Chen","doi":"10.1016/j.cviu.2024.104066","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104066","url":null,"abstract":"<div><p>Current <em>training data-dependent</em> physical attacks have limited applicability to privacy-critical situations when attackers lack access to neural networks’ training data. To address this issue, this paper presents a hierarchical adversarial patch generation framework considering data privacy, utilizing <em>proxy datasets</em> while assuming that the training data is blinded. In the upper layer, <strong>Average Patch Saliency</strong> (<strong>APS</strong>) is introduced as a quantitative metric to determine the best proxy dataset for patch generation from a set of publicly available datasets. In the lower layer, <strong>Expectation of Transformation Plus</strong> (<strong>EoT+</strong>) method is developed to generate patches while accounting for perturbing background simulation and sensitivity alleviation. Evaluation results obtained in digital settings show that the proposed proxy-data-based framework achieves comparable targeted attack results to the data-dependent benchmark method. Finally, the framework’s validity is comprehensively evaluated in the physical world, where the corresponding experimental videos and code can be found at <span>here</span><svg><path></path></svg>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141480368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}