{"title":"Deep learning model for simultaneous recognition of quantitative and qualitative emotion using visual and bio-sensing data","authors":"Iman Hosseini , Md Zakir Hossain , Yuhao Zhang , Shafin Rahman","doi":"10.1016/j.cviu.2024.104121","DOIUrl":"10.1016/j.cviu.2024.104121","url":null,"abstract":"<div><p>The recognition of emotions heavily relies on important factors such as human facial expressions and physiological signals, including electroencephalogram and electrocardiogram. In literature, emotion recognition is investigated quantitatively (while estimating valance, arousal, and dominance) and qualitatively (while predicting discrete emotions like happiness, sadness, anger, surprise, and so on). Current methods utilize a combination of visual data and bio-sensing information to create recognition systems that incorporate multiple modes (quantitative/qualitative). Nevertheless, these methods necessitate extensive expertise in specific domains and intricate preprocessing procedures, and consequently, they are unable to fully leverage the inherent advantages of end-to-end deep learning techniques. Moreover, methods usually aim to recognize either qualitative or quantitative emotions. Although both kinds of emotions are significantly co-related, previous methods do not simultaneously recognize qualitative and quantitative emotions. In this paper, a novel deep end-to-end framework named DeepVADNet is introduced, specifically designed for the purpose of multi-modal emotion recognition. The proposed framework leverages deep learning techniques to effectively extract crucial face appearance features as well as bio-sensing features, predicting both qualitative and quantitative emotions in a single forward pass. In this study, we employ the CRNN architecture to extract face appearance features, while the ConvLSTM model is utilized to extract spatio-temporal information from visual data (videos). Additionally, we utilize the Conv1D model for processing physiological signals (EEG, EOG, ECG, and GSR) as this approach deviates from conventional manual techniques that involve traditional manual methods for extracting features based on time and frequency domains. After enhancing the feature quality by fusing both modalities, we use a novel method employing quantitative emotion to predict qualitative emotions accurately. We perform extensive experiments on the DEAP and MAHNOB-HCI datasets, achieving state-of-the-art quantitative emotion recognition results of 98.93%/6e-4 and 89.08%/0.97 (mean classification accuracy/MSE) in both datasets, respectively. Also, for the qualitative emotion recognition task, we achieve 82.71% mean classification accuracy on the MAHNOB-HCI dataset. The code and evaluation can be accessed at: <span><span>https://github.com/I-Man-H/DeepVADNet.git</span><svg><path></path></svg></span></p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104121"},"PeriodicalIF":4.3,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Audio–visual deepfake detection using articulatory representation learning","authors":"Yujia Wang, Hua Huang","doi":"10.1016/j.cviu.2024.104133","DOIUrl":"10.1016/j.cviu.2024.104133","url":null,"abstract":"<div><p>Advancements in generative artificial intelligence have made it easier to manipulate auditory and visual elements, highlighting the critical need for robust audio–visual deepfake detection methods. In this paper, we propose an articulatory representation-based audio–visual deepfake detection approach, <em>ART-AVDF</em>. First, we devise an audio encoder to extract articulatory features that capture the physical significance of articulation movement, integrating with a lip encoder to explore audio–visual articulatory correspondences in a self-supervised learning manner. Then, we design a multimodal joint fusion module to further explore inherent audio–visual consistency using the articulatory embeddings. Extensive experiments on the DFDC, FakeAVCeleb, and DefakeAVMiT datasets demonstrate that <em>ART-AVDF</em> obtains a significant performance improvement compared to many deepfake detection models.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104133"},"PeriodicalIF":4.3,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jin Liu , Yang Yang , Biyun Xu , Hao Yu , Yaozong Zhang , Qian Li , Zhenghua Huang
{"title":"RSTC: Residual Swin Transformer Cascade to approximate Taylor expansion for image denoising","authors":"Jin Liu , Yang Yang , Biyun Xu , Hao Yu , Yaozong Zhang , Qian Li , Zhenghua Huang","doi":"10.1016/j.cviu.2024.104132","DOIUrl":"10.1016/j.cviu.2024.104132","url":null,"abstract":"<div><p>Traditional denoising methods establish mathematical models by employing different priors, which can achieve preferable results but they are usually time-consuming and their outputs are not adaptive on regularization parameters. While the success of end-to-end deep learning denoising strategies depends on a large amount of data and lacks a theoretical interpretability. In order to address the above problems, this paper proposes a novel image denoising method, namely Residual Swin Transformer Cascade (RSTC), based on Taylor expansion. The key procedures of our RSTC are specified as follows: Firstly, we discuss the relationship between image denoising model and Taylor expansion, as well as its adjacent derivative parts. Secondly, we use a lightweight deformable convolutional neural network to estimate the basic layer of Taylor expansion and a residual network where swin transformer block is selected as a backbone for pursuing the solution of the derivative layer. Finally, the results of the two networks contribute to the approximation solution of Taylor expansion. In the experiments, we firstly test and discuss the selection of network parameters to verify its effectiveness. Then, we compare it with existing advanced methods in terms of visualization and quantification, and the results show that our method has a powerful generalization ability and performs better than state-of-the-art denoising methods on performance improvement and structure preservation.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104132"},"PeriodicalIF":4.3,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142048067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep video compression based on Long-range Temporal Context Learning","authors":"Kejun Wu , Zhenxing Li , You Yang, Qiong Liu","doi":"10.1016/j.cviu.2024.104127","DOIUrl":"10.1016/j.cviu.2024.104127","url":null,"abstract":"<div><p>Video compression allows for efficient storage and transmission of data, benefiting imaging and vision applications, e.g. computational imaging, photography, and displays by delivering high-quality videos. To exploit more informative contexts of video, we propose DVCL, a novel <strong>D</strong>eep <strong>V</strong>ideo <strong>C</strong>ompression based on <strong>L</strong>ong-range Temporal Context Learning. Aiming at high coding performance, this new compression paradigm makes full use of long-range temporal correlations derived from multiple reference frames to learn richer contexts. Motion vectors (MVs) are estimated to represent the motion relations of videos. By employing MVs, a long-range temporal context learning (LTCL) module is presented to extract context information from multiple reference frames, such that a more accurate and informative temporal contexts can be learned and constructed. The long-range temporal contexts serve as conditions and generate the predicted frames by contextual encoder and decoder. To address the challenge of imbalanced training, we develop a multi-stage training strategy to ensure the whole DVCL framework is trained progressively and stably. Extensive experiments demonstrate the proposed DVCL achieves the highest objective and subjective quality, while maintaining relatively low complexity. Specifically, 25.30% and 45.75% bitrate savings on average can be obtained than x265 codec at the same PSNR and MS-SSIM, respectively.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104127"},"PeriodicalIF":4.3,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142129056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep unsupervised shadow detection with curriculum learning and self-training","authors":"Qiang Zhang, Hongyuan Guo, Guanghe Li, Tianlu Zhang, Qiang Jiao","doi":"10.1016/j.cviu.2024.104124","DOIUrl":"10.1016/j.cviu.2024.104124","url":null,"abstract":"<div><p>Shadow detection is undergoing a rapid and remarkable development along with the wide use of deep neural networks. Benefiting from a large number of training images annotated with strong pixel-level ground-truth masks, current deep shadow detectors have achieved state-of-the-art performance. However, it is expensive and time-consuming to provide the pixel-level ground-truth mask for each training image. Considering that, this paper proposes the first unsupervised deep shadow detection framework, which consists of an initial pseudo label generation (IPG) module, a curriculum learning (CL) module and a self-training (ST) module. The supervision signals used in our learning framework are generated from several existing traditional unsupervised shadow detectors, which usually contain a lot of noisy information. Therefore, each module in our unsupervised framework is dedicated to reduce the adverse influence of noisy information on model training. Specifically, the IPG module combines different traditional unsupervised shadow maps to obtain their complementary shadow information. After obtaining the initial pseudo labels, the CL module and the ST module will be used in conjunction to gradually learn new shadow patterns and update the qualities of pseudo labels simultaneously. Extensive experimental results on various benchmark datasets demonstrate that our deep shadow detector not only outperforms the traditional unsupervised shadow detection methods by a large margin but also achieves comparable results with some recent state-of-the-art fully-supervised deep shadow detection methods.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104124"},"PeriodicalIF":4.3,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peng Zhang , Xinlei Zhao , Lijia Dong , Weimin Lei , Wei Zhang , Zhaonan Lin
{"title":"A framework for detecting fighting behavior based on key points of human skeletal posture","authors":"Peng Zhang , Xinlei Zhao , Lijia Dong , Weimin Lei , Wei Zhang , Zhaonan Lin","doi":"10.1016/j.cviu.2024.104123","DOIUrl":"10.1016/j.cviu.2024.104123","url":null,"abstract":"<div><p>Detecting fights from videos and images in public surveillance places is an important task to limit violent criminal behavior. Real-time detection of violent behavior can effectively ensure the personal safety of pedestrians and further maintain public social stability. Therefore, in this paper, we aim to detect real-time violent behavior in videos. We propose a novel neural network model framework based on human pose key points, called Real-Time Pose Net (RTPNet). Utilize the pose extractor (YOLO-Pose) to extract human skeleton features, and classify video level violent behavior based on the 2DCNN model (ACTION-Net). Utilize appearance features and inter frame correlation to accurately detect fighting behavior. We have also proposed a new image dataset called VIMD (Violence Image Dataset), which includes images of fighting behavior collected online and captured independently. After training on the dataset, the network can effectively identify skeletal features from videos and locate fighting movements. The dataset is available on GitHub (<span><span>https://github.com/ChinaZhangPeng/Violence-Image-Dataset</span><svg><path></path></svg></span>). We also conducted experiments on four datasets, including Hockey-Fight, RWF-2000, Surveillance Camera Fight, and AVD dataset. These experimental results showed that RTPNet outperformed the most advanced methods in the past, achieving an accuracy of 99.4% on the Hockey-Fight dataset, 93.3% on the RWF-2000 dataset, and 93.4% on the Surveillance Camera Fight dataset, 99.3% on the AVD dataset. And with speeds capable of reaching 33fps, state-of-the-art results are achieved with faster speeds. In addition, RTPNet can also have good detection performance in violent behavior in complex backgrounds.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104123"},"PeriodicalIF":4.3,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziyu Zhao, Leilei Gan, Tao Shen, Kun Kuang, Fei Wu
{"title":"Deconfounded hierarchical multi-granularity classification","authors":"Ziyu Zhao, Leilei Gan, Tao Shen, Kun Kuang, Fei Wu","doi":"10.1016/j.cviu.2024.104108","DOIUrl":"10.1016/j.cviu.2024.104108","url":null,"abstract":"<div><p>Hierarchical multi-granularity classification (HMC) assigns labels at varying levels of detail to images using a structured hierarchy that categorizes labels from coarse to fine, such as [“Suliformes”, “Fregatidae”, “Frigatebird”]. Traditional HMC methods typically integrate hierarchical label information into either the model’s architecture or its loss function. However, these approaches often overlook the spurious correlations between coarse-level semantic information and fine-grained labels, which can lead models to rely on these non-causal relationships for making predictions. In this paper, we adopt a causal perspective to address the challenges in HMC, demonstrating how coarse-grained semantics can serve as confounders in fine-grained classification. To comprehensively mitigate confounding bias in HMC, we introduce a novel framework, Deconf-HMC, which consists of three main components: (1) a causal-inspired label prediction module that combines fine-level features with coarse-level prediction outcomes to determine the appropriate labels at each hierarchical level; (2) a representation disentanglement module that minimizes the mutual information between representations of different granularities; and (3) an adversarial training module that restricts the predictive influence of coarse-level representations on fine-level labels, thereby aiming to eliminate confounding bias. Extensive experiments on three widely used datasets demonstrate the superiority of our approach over existing state-of-the-art HMC methods.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104108"},"PeriodicalIF":4.3,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142041105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huanlong Zhang , Mengdan Liu , Xiaohui Song , Yong Wang , Guanglu Yang , Rui Qi
{"title":"Spatial attention inference model for cascaded siamese tracking with dynamic residual update strategy","authors":"Huanlong Zhang , Mengdan Liu , Xiaohui Song , Yong Wang , Guanglu Yang , Rui Qi","doi":"10.1016/j.cviu.2024.104125","DOIUrl":"10.1016/j.cviu.2024.104125","url":null,"abstract":"<div><p>Target representation is crucial for visual tracking. Most Siamese-based trackers try their best to establish target models by using various deep networks. However, they neglect the exploration of correlation among features, which leads to the inability to learn more representative features. In this paper, we propose a spatial attention inference model for cascaded Siamese tracking with dynamic residual update strategy. First, a spatial attention inference model is constructed. The model fuses interlayer multi-scale features generated by dilation convolution to enhance the spatial representation ability of features. On this basis, we use self-attention to capture interaction between target and context, and use cross-attention to aggregate interdependencies between target and background. The model infers potential feature information by exploiting the correlations among features for building better appearance models. Second, a cascaded localization-aware network is introduced to bridge a gap between classification and regression. We propose an alignment-aware branch to resample and learn object-aware features from the predicted bounding boxes for obtaining localization confidence, which is used to correct the classification confidence by weighted integration. This cascaded strategy alleviates the misalignment problem between classification and regression. Finally, a dynamic residual update strategy is proposed. This strategy utilizes the Context Fusion Network (CFNet) to fuse the templates of historical and current frames to generate the optimal templates. Meanwhile, we use a dynamic threshold function to determine when to update by judging the tracking results. The strategy uses temporal context to fully explore the intrinsic properties of the target, which enhances the adaptability to changes in the target’s appearance. We conducted extensive experiments on seven tracking benchmarks, including OTB100, UAV123, TC128, VOT2016, VOT2018, GOT10k and LaSOT, to validate the effectiveness of our proposed algorithm.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104125"},"PeriodicalIF":4.3,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142041106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liyan Wang , Qinyu Yang , Cong Wang , Wei Wang , Zhixun Su
{"title":"Coarse-to-fine mechanisms mitigate diffusion limitations on image restoration","authors":"Liyan Wang , Qinyu Yang , Cong Wang , Wei Wang , Zhixun Su","doi":"10.1016/j.cviu.2024.104118","DOIUrl":"10.1016/j.cviu.2024.104118","url":null,"abstract":"<div><p>Recent years have witnessed the remarkable performance of diffusion models in various vision tasks. However, for image restoration that aims to recover clear images with sharper details from given degraded observations, diffusion-based methods may fail to recover promising results due to inaccurate noise estimation. Moreover, simple constraining noises cannot effectively learn complex degradation information, which subsequently hinders the model capacity. To solve the above problems, we propose a coarse-to-fine diffusion Transformer (C2F-DFT) to mitigate diffusion limitations mentioned before on image restoration. Specifically, the proposed C2F-DFT contains diffusion self-attention (DFSA) and diffusion feed-forward network (DFN) within a new coarse-to-fine training mechanism. The DFSA and DFN with embedded diffusion steps respectively capture the long-range diffusion dependencies and learn hierarchy diffusion representation to guide the restoration process in different time steps. In the coarse training stage, our C2F-DFT estimates noises and then generates the final clean image by a sampling algorithm. To further improve the restoration quality, we propose a simple yet effective fine training pipeline. It first exploits the coarse-trained diffusion model with fixed steps to generate restoration results, which then would be constrained with corresponding ground-truth ones to optimize the models to remedy the unsatisfactory results affected by inaccurate noise estimation. Extensive experiments show that C2F-DFT significantly outperforms diffusion-based restoration method IR-SDE and achieves competitive performance compared with Transformer-based state-of-the-art methods on 3 tasks, including image deraining, image deblurring, and real image denoising. The source codes and visual results are available at <span><span>https://github.com/wlydlut/C2F-DFT</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104118"},"PeriodicalIF":4.3,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Chen , Jing Zhang , Yian Zhang , Junpeng Kang , Li Zhuo
{"title":"MKP-Net: Memory knowledge propagation network for point-supervised temporal action localization in livestreaming","authors":"Lin Chen , Jing Zhang , Yian Zhang , Junpeng Kang , Li Zhuo","doi":"10.1016/j.cviu.2024.104109","DOIUrl":"10.1016/j.cviu.2024.104109","url":null,"abstract":"<div><p>Standardized regulation of livestreaming is an important element of cyberspace governance. Temporal action localization (TAL) can localize the occurrence of specific actions to better understand human activities. Due to the short duration and inconspicuous boundaries of human-specific actions, it is very cumbersome to obtain sufficient labeled data for training in untrimmed livestreaming. The point-supervised approach requires only a single-frame annotation for each action instance and can effectively balance cost and performance. Therefore, we propose a memory knowledge propagation network (MKP-Net) for point-supervised temporal action localization in livestreaming, including (1) a plug-and-play memory module is introduced to model prototype features of foreground actions and background knowledge using point-level annotations, (2) the memory knowledge propagation mechanism is used to generate discriminative feature representation in a multi-instance learning pipeline, and (3) localization completeness learning is performed by designing a dual optimization loss for refining and localizing temporal actions. Experimental results show that our method achieves 61.4% and 49.1% SOTAs on THUMOS14 and self-built BJUT-PTAL datasets, respectively, with an inference speed of 711 FPS.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104109"},"PeriodicalIF":4.3,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142048068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}