{"title":"Facial Action Unit Detection by Adaptively Constraining Self-Attention and Causally Deconfounding Sample","authors":"Zhiwen Shao, Hancheng Zhu, Yong Zhou, Xiang Xiang, Bing Liu, Rui Yao, Lizhuang Ma","doi":"10.1007/s11263-024-02258-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02258-6","url":null,"abstract":"<p>Facial action unit (AU) detection remains a challenging task, due to the subtlety, dynamics, and diversity of AUs. Recently, the prevailing techniques of self-attention and causal inference have been introduced to AU detection. However, most existing methods directly learn self-attention guided by AU detection, or employ common patterns for all AUs during causal intervention. The former often captures irrelevant information in a global range, and the latter ignores the specific causal characteristic of each AU. In this paper, we propose a novel AU detection framework called <span>(textrm{AC}^{2})</span>D by adaptively constraining self-attention weight distribution and causally deconfounding the sample confounder. Specifically, we explore the mechanism of self-attention weight distribution, in which the self-attention weight distribution of each AU is regarded as spatial distribution and is adaptively learned under the constraint of location-predefined attention and the guidance of AU detection. Moreover, we propose a causal intervention module for each AU, in which the bias caused by training samples and the interference from irrelevant AUs are both suppressed. Extensive experiments show that our method achieves competitive performance compared to state-of-the-art AU detection approaches on challenging benchmarks, including BP4D, DISFA, GFT, and BP4D+ in constrained scenarios and Aff-Wild2 in unconstrained scenarios.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"232 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142448787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rizhao Cai, Cecelia Soh, Zitong Yu, Haoliang Li, Wenhan Yang, Alex C. Kot
{"title":"Towards Data-Centric Face Anti-spoofing: Improving Cross-Domain Generalization via Physics-Based Data Synthesis","authors":"Rizhao Cai, Cecelia Soh, Zitong Yu, Haoliang Li, Wenhan Yang, Alex C. Kot","doi":"10.1007/s11263-024-02240-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02240-2","url":null,"abstract":"<p>Face Anti-Spoofing (FAS) research is challenged by the cross-domain problem, where there is a domain gap between the training and testing data. While recent FAS works are mainly model-centric, focusing on developing domain generalization algorithms for improving cross-domain performance, data-centric research for face anti-spoofing, improving generalization from data quality and quantity, is largely ignored. Therefore, our work starts with data-centric FAS by conducting a comprehensive investigation from the data perspective for improving cross-domain generalization of FAS models. More specifically, at first, based on physical procedures of capturing and recapturing, we propose task-specific FAS data augmentation (FAS-Aug), which increases data diversity by synthesizing data of artifacts, such as printing noise, color distortion, moiré pattern, etc. Our experiments show that using our FAS augmentation can surpass traditional image augmentation in training FAS models to achieve better cross-domain performance. Nevertheless, we observe that models may rely on the augmented artifacts, which are not environment-invariant, and using FAS-Aug may have a negative effect. As such, we propose Spoofing Attack Risk Equalization (SARE) to prevent models from relying on certain types of artifacts and improve the generalization performance. Last but not least, our proposed FAS-Aug and SARE with recent Vision Transformer backbones can achieve state-of-the-art performance on the FAS cross-domain generalization protocols. The implementation is available at https://github.com/RizhaoCai/FAS-Aug.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"9 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142448786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Blind Multimodal Quality Assessment of Low-Light Images","authors":"Miaohui Wang, Zhuowei Xu, Mai Xu, Weisi Lin","doi":"10.1007/s11263-024-02239-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02239-9","url":null,"abstract":"<p>Blind image quality assessment (BIQA) aims at automatically and accurately forecasting objective scores for visual signals, which has been widely used to monitor product and service quality in low-light applications, covering smartphone photography, video surveillance, autonomous driving, etc. Recent developments in this field are dominated by unimodal solutions inconsistent with human subjective rating patterns, where human visual perception is simultaneously reflected by multiple sensory information. In this article, we present a unique blind multimodal quality assessment (BMQA) of low-light images from subjective evaluation to objective score. To investigate the multimodal mechanism, we first establish a multimodal low-light image quality (MLIQ) database with authentic low-light distortions, containing image-text modality pairs. Further, we specially design the key modules of BMQA, considering multimodal quality representation, latent feature alignment and fusion, and hybrid self-supervised and supervised learning. Extensive experiments show that our BMQA yields state-of-the-art accuracy on the proposed MLIQ benchmark database. In particular, we also build an independent single-image modality Dark-4K database, which is used to verify its applicability and generalization performance in mainstream unimodal applications. Qualitative and quantitative results on Dark-4K show that BMQA achieves superior performance to existing BIQA approaches as long as a pre-trained model is provided to generate text descriptions. The proposed framework and two databases as well as the collected BIQA methods and evaluation metrics are made publicly available on https://charwill.github.io/bmqa.html.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"1 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142443819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Audio-Visual Segmentation with Semantics","authors":"Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong","doi":"10.1007/s11263-024-02261-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02261-x","url":null,"abstract":"<p>We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, <i>i.e.</i>, AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires to generate semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench dataset compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code can be found at https://github.com/OpenNLPLab/AVSBench.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"7 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142440236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Accurate Low-bit Quantization towards Efficient Computational Imaging","authors":"Sheng Xu, Yanjing Li, Chuanjian Liu, Baochang Zhang","doi":"10.1007/s11263-024-02250-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02250-0","url":null,"abstract":"<p>Recent advances of deep neural networks (DNNs) promote low-level vision applications in real-world scenarios, <i>e.g.</i>, image enhancement, dehazing. Nevertheless, DNN-based methods encounter challenges in terms of high computational and memory requirements, especially when deployed on real-world devices with limited resources. Quantization is one of effective compression techniques that significantly reduces computational and memory requirements by employing low-bit parameters and bit-wise operations. However, low-bit quantization for computational imaging (<b>Q-Imaging</b>) remains largely unexplored and usually suffer from a significant performance drop compared with the real-valued counterparts. In this work, through empirical analysis, we identify the main factor responsible for such significant performance drop underlies in the large gradient estimation error from non-differentiable weight quantization methods, and the activation information degeneration along with the activation quantization. To address these issues, we introduce a differentiable quantization search (DQS) method to learn the quantized weights and an information boosting module (IBM) for network activation quantization. Our DQS method allows us to treat the discrete weights in a quantized neural network as variables that can be searched. We achieve this end by using a differential approach to accurately search for these weights. In specific, each weight is represented as a probability distribution across a set of discrete values. During training, these probabilities are optimized, and the values with the highest probabilities are chosen to construct the desired quantized network. Moreover, our IBM module can rectify the activation distribution before quantization to maximize the self-information entropy, which retains the maximum information during the quantization process. Extensive experiments across a range of image processing tasks, including enhancement, super-resolution, denoising and dehazing, validate the effectiveness of our Q-Imaging along with superior performances compared to a variety of state-of-the-art quantization methods. In particular, the method in Q-Imaging also achieves a strong generalization performance when composing a detection network for the dark object detection task.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"69 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142431388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Ultra High-Speed Hyperspectral Imaging by Integrating Compressive and Neuromorphic Sampling","authors":"Mengyue Geng, Lizhi Wang, Lin Zhu, Wei Zhang, Ruiqin Xiong, Yonghong Tian","doi":"10.1007/s11263-024-02236-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02236-y","url":null,"abstract":"<p>Hyperspectral and high-speed imaging are both important for scene representation and understanding. However, simultaneously capturing both hyperspectral and high-speed data is still under-explored. In this work, we propose a high-speed hyperspectral imaging system by integrating compressive sensing sampling with bioinspired neuromorphic sampling. Our system includes a coded aperture snapshot spectral imager capturing moderate-speed hyperspectral measurement frames and a spike camera capturing high-speed grayscale dense spike streams. The two cameras provide complementary dual-modality data for reconstructing high-speed hyperspectral videos (HSV). To effectively synergize the two sampling mechanisms and obtain high-quality HSV, we propose a unified multi-modal reconstruction framework. The framework consists of a Spike Spectral Prior Network for spike-based information extraction and prior regularization, coupled with a dual-modality iterative optimization algorithm for reliable reconstruction. We finally build a hardware prototype to verify the effectiveness of our system and algorithm design. Experiments on both simulated and real data demonstrate the superiority of the proposed approach, where for the first time to our knowledge, high-speed HSV with 30 spectral bands can be captured at a frame rate of up to 20,000 FPS.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"10 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142431329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Patrick Wenzel, Nan Yang, Rui Wang, Niclas Zeller, Daniel Cremers
{"title":"4Seasons: Benchmarking Visual SLAM and Long-Term Localization for Autonomous Driving in Challenging Conditions","authors":"Patrick Wenzel, Nan Yang, Rui Wang, Niclas Zeller, Daniel Cremers","doi":"10.1007/s11263-024-02230-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02230-4","url":null,"abstract":"<p>In this paper, we present a novel visual SLAM and long-term localization benchmark for autonomous driving in challenging conditions based on the large-scale 4Seasons dataset. The proposed benchmark provides drastic appearance variations caused by seasonal changes and diverse weather and illumination conditions. While significant progress has been made in advancing visual SLAM on small-scale datasets with similar conditions, there is still a lack of unified benchmarks representative of real-world scenarios for autonomous driving. We introduce a new unified benchmark for jointly evaluating visual odometry, global place recognition, and map-based visual localization performance which is crucial to successfully enable autonomous driving in any condition. The data has been collected for more than one year, resulting in more than 300 km of recordings in nine different environments ranging from a multi-level parking garage to urban (including tunnels) to countryside and highway. We provide globally consistent reference poses with up to centimeter-level accuracy obtained from the fusion of direct stereo-inertial odometry with RTK GNSS. We evaluate the performance of several state-of-the-art visual odometry and visual localization baseline approaches on the benchmark and analyze their properties. The experimental results provide new insights into current approaches and show promising potential for future research. Our benchmark and evaluation protocols will be available at https://go.vision.in.tum.de/4seasons.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"24 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142431462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Edge-Oriented Adversarial Attack for Deep Gait Recognition","authors":"Saihui Hou, Zengbin Wang, Man Zhang, Chunshui Cao, Xu Liu, Yongzhen Huang","doi":"10.1007/s11263-024-02225-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02225-1","url":null,"abstract":"<p>Gait recognition is a non-intrusive method that captures unique walking patterns without subject cooperation, which has emerged as a promising technique across various fields. Recent studies based on Deep Neural Networks (DNNs) have notably improved the performance, however, the potential vulnerability inherent in DNNs and their resistance to interference in practical gait recognition systems remain under-explored. To fill the gap, in this paper, we focus on imperceptible adversarial attack for deep gait recognition and propose an edge-oriented attack strategy tailored for silhouette-based approaches. Specifically, we make a pioneering attempt to explore the intrinsic characteristics of binary silhouettes, with a primary focus on injecting noise perturbations into the edge area. This simple yet effective solution enables sparse attack in both the spatial and temporal dimensions, which largely ensures imperceptibility and simultaneously achieves high success rate. In particular, our solution is built on a unified framework, allowing seamless switching between untargeted and targeted attack modes. Extensive experiments conducted on in-the-lab and in-the-wild benchmarks validate the effectiveness of our attack strategy and emphasize the necessity to study adversarial attack and defense strategy in the near future.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"54 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142405348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmed R. El-gabri, Hussein A. Aly, Tarek S. Ghoniemy, Mohamed A. Elshafey
{"title":"DLRA-Net: Deep Local Residual Attention Network with Contextual Refinement for Spectral Super-Resolution","authors":"Ahmed R. El-gabri, Hussein A. Aly, Tarek S. Ghoniemy, Mohamed A. Elshafey","doi":"10.1007/s11263-024-02238-w","DOIUrl":"https://doi.org/10.1007/s11263-024-02238-w","url":null,"abstract":"<p>Hyperspectral Images (HSIs) provide detailed scene insights using extensive spectral bands, crucial for material discrimination and earth observation with substantial costs and low spatial resolution. Recently, Convolutional Neural Networks (CNNs) are common choice for Spectral Super-Resolution (SSR) from Multispectral Images (MSIs). However, they often fail to simultaneously exploit pixel-level noise degradation of MSIs and complex contextual spatial-spectral characteristics of HSIs. In this paper, a Deep Local Residual Attention Network with Contextual Refinement Network (DLRA-Net) is proposed to integrate local low-rank spectral and global contextual priors for improved SSR. Specifically, SSR is unfolded into Contextual-attention Refinement Module (CRM) and Dual Local Residual Attention Module (DLRAM). CRM is proposed to adaptively learn complex contextual priors to guide the convolution layer weights for improved spatial restorations. While DLRAM captures deep refined texture details to enhance contextual priors representations for recovering HSIs. Moreover, lateral fusion strategy is designed to integrate the obtained priors among DLRAMs for faster network convergence. Experimental results on natural-scene datasets with practical noise patterns confirm exceptional DLRA-Net performance with relatively small model size. DLRA-Net demonstrates Maximum Relative Improvements (MRI) between 9.71 and 58.58% in Mean Relative Absolute Error (MRAE) with reduced parameters between 52.18 and 85.85%. Besides, a practical RS-HSI dataset is generated for evaluations showing MRI between 8.64 and 50.56% in MRAE. Furthermore, experiments with HSI classifiers indicate improved performance of reconstructed RS-HSIs compared to RS-MSIs, with MRI in Overall Accuracy (OA) between 7.10 and 15.27%. Lastly, a detailed ablation study assesses model complexity and runtime.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"43 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142397921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Yu, Rongrong Ni, Siyuan Yang, Yu Ni, Yao Zhao, Alex C. Kot
{"title":"Mining Generalized Multi-timescale Inconsistency for Detecting Deepfake Videos","authors":"Yang Yu, Rongrong Ni, Siyuan Yang, Yu Ni, Yao Zhao, Alex C. Kot","doi":"10.1007/s11263-024-02249-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02249-7","url":null,"abstract":"<p>Recent advancements in face forgery techniques have continuously evolved, leading to emergent security concerns in society. Existing detection methods have poor generalization ability due to the insufficient extraction of dynamic inconsistency cues on the one hand, and their inability to deal well with the gaps between forgery techniques on the other hand. To develop a new generalized framework that emphasizes extracting generalizable multi-timescale inconsistency cues. Firstly, we capture subtle dynamic inconsistency via magnifying the multipath dynamic inconsistency from the local-consecutive short-term temporal view. Secondly, the inter-group graph learning is conducted to establish the sufficient-interactive long-term temporal view for capturing dynamic inconsistency comprehensively. Finally, we design the domain alignment module to directly reduce the distribution gaps via simultaneously disarranging inter- and intra-domain feature distributions for obtaining a more generalized framework. Extensive experiments on six large-scale datasets and the designed generalization evaluation protocols show that our framework outperforms state-of-the-art deepfake video detection methods.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"99 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142398163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}