{"title":"Joint Statistical and Causal Feature Modulated Face Anti-Spoofing","authors":"Xin Dong, Tao Wang, Zhendong Li, Hao Liu","doi":"10.1109/ICME55011.2023.00210","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00210","url":null,"abstract":"In this paper, we propose a hierarchical feature modulation (HFM) approach for stable face anti-spoofing in unseen domains and unseen attacks. The conventional multi-domain based generalizable approaches likely lead to local optima due to the complicated or heuristic learning paradigm. Inspired by the fact that high-level semantic disturbances and low-level miscellaneous bias jointly cause the distribution shift, HFM aims to modulate the fine-grained feature in a hierarchical manner. Specifically, we complement the structural feature with patch-wise learnable statistical information, i.e. local difference histogram, to relieve the overfitting on high-level semantics. We further introduce the structural causal model (SCM) with imaging color model to reveal that presenting mediums and capturing devices destroy the liveness-relevant information from the low level. Thus we model this hidden entanglement as a distribution mixture problem and propose the expectation-maximization (EM) based causal intervention to remove these miscellanies. Experimental results on public datasets demonstrate the effectiveness of HFM, especially in out-of-distribution settings.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129809218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Face Poison: Obstructing DeepFakes by Disrupting Face Detection","authors":"Yuezun Li, Jiaran Zhou, Siwei Lyu","doi":"10.1109/ICME55011.2023.00213","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00213","url":null,"abstract":"Recent years have seen fast development in synthesizing realistic human faces using AI-based forgery technique called DeepFake, which can be weaponized to cause negative personal and social impacts. In this work, we develop a defense method, namely FacePosion, to prevent individuals from becoming victims of DeepFake videos by sabotaging would-be training data. This is achieved by disrupting face detection, a prerequisite step to prepare victim faces for training DeepFake model. Once the training faces are wrongly extracted, the DeepFake model can not be well trained. Specifically, we propose a multi-scale feature-level adversarial attack to disrupt the intermediate features of face detectors using different scales. Extensive experiments are conducted on seven various DeepFake models using six face detection methods, empirically showing that disrupting face detectors using our method can effectively obstruct DeepFakes.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127254986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LC-Beating: An Online System for Beat and Downbeat Tracking using Latency-Controlled Mechanism","authors":"Xinlu Liu, Jiale Qian, Qiqi He, Yi Yu, Wei Li","doi":"10.1109/ICME55011.2023.00192","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00192","url":null,"abstract":"Beat and downbeat tracking is to predict beat and downbeat time steps from a given music piece. Some deep learning models with a dilated structure such as Temporal Convolutional Network (TCN) and Dilated Self-Attention Network (DSAN) have achieved promising performance for this task. However, most of them have to see the whole music context during inference, which limits their deployment to online systems. In this paper, we propose LC-Beating, a novel latency-controlled (LC) mechanism for online beat and downbeat tracking, in which the model only looks ahead a few frames. By appending limited future information, the model can better capture the activity of relevant musical beats, which significantly boosts the performance of online algorithms with limited latency. Moreover, LC-Beating applies a novel real-time implementation of the LC mechanism to TCN and DSAN. The experimental results show that our proposed method outperforms the previous online models by a large margin and is close to the results of the offline models.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129947550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inter-Intra Camera Identity Learning for Person Re-Identification with Training in Single Camera","authors":"Guoqing Zhang, Zhiyuan Luo, Weisi Lin, Xuan Jing","doi":"10.1109/ICME55011.2023.00414","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00414","url":null,"abstract":"Traditional person re-identification (re-ID) methods generally rely on inter-camera person images to smooth the domain disparities between cameras. However, collecting and annotating a large number of inter-camera identities is extremely difficult and time-consuming, and this makes it hard to deploy person re-ID systems in new locations. To tackle this challenge, this paper studies the single-camera-training (SCT) setting where every person in the training set only appears in one camera. In this work, we design a novel inter-intra camera identity learning (I2CIL) framework to effectively address the SCT person re-ID. Specifically, (i) we design a Dual-Branch Identity Learning (DBIL) network consisting of inter-camera and intra-camera learning branches to learn person ID discriminative information. The former learns camera-irrelevant feature representations by constraining the distance of inter-camera negative sample pairs closer than the distance of intra-camera negative sample pairs. The latter focuses on pulling the distance of intra-camera positive sample pairs closer and pushing the distance of intra-camera negative sample pairs further, partially alleviating weak ID discrimination caused by the lack of inter-camera annotations. (ii) We design a Mixed-Sampling Joint Learning (MSJL) strategy, which is capable to capture inter- and intra-camera samples and independently accomplish the inter- and intra-camera learning tasks at the same time, avoiding the mutual interference between the two tasks. Extensive experiments on two public SCT datasets prove the superiority of the proposed approach.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129083436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual Episodic Sampling and Momentum Consistency Regularization for Unsupervised Few-shot Learning","authors":"Jiaxin Chen, Yanxu Hu, Meng Shen, A. J. Ma","doi":"10.1109/ICME55011.2023.00491","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00491","url":null,"abstract":"Unsupervised Few-shot Learning (UFSL) is a practical approach to adapting knowledge learned from unlabeled data of base classes to novel classes with limited labeled data. Nevertheless, most existing UFSL methods may not learn generalizable features in latter training epochs due to the simplicity of meta-learning tasks constructed by data augmentation. To address this issue, we propose two novel components, namely Dual Episodic Sampling (DES) and Momentum Consistency Regularization (MCR) for UFSL. In the DES, two types of sampling strategies are used to construct harder training tasks with multiple augmentations to generate each pseudo-class of increased diversity. The MCR constrains the consistency of the backbone encoder with its momentum counterpart to learn better generalized features for novel classes. Experimental results on four datasets verify the superiority of our method for unsupervised few-shot image classification.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122377084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Latent Feature Regularization based Adversarial Network for Brain Tumor Anomaly Detection","authors":"Nan Wang, Chengwei Chen, Lizhuang Ma, Shaohui Lin","doi":"10.1109/ICME55011.2023.00168","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00168","url":null,"abstract":"Brain tumor anomaly detection plays a critical role in the field of computer-aided diagnosis, which has attracted ever-increasing focus from the medical community However, brain tumor data are scarce and tough to classify. Unsupervised methods enable the reduction of huge labeling costs to be applied to brain tumor anomaly detection during the training only given normal brain images. However, the existing unsupervised methods distinguish whether the input image is abnormal in the image space, which cannot effectively learn the discriminative features. In this paper, we propose a novel brain tumor anomaly detection method via Latent Feature Regularization based Adversarial Network (LFRA-Net), which leverages a latent feature regularizer into adversarial learning to obtain the discriminative features. Comprehensive experiments on BraTS, HCP, MNIST, and CIFAR-10 datasets evaluate the effectiveness of our LFRANet, which outperforms state-of-the-art unsupervised learning methods.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131082369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaqian Zhou, Yu Liu, Dan Song, Jiayu Li, Xuanya Li, Anjin Liu
{"title":"Cross-domain Prototype Contrastive loss for Few-shot 2D Image-Based 3D Model Retrieval","authors":"Yaqian Zhou, Yu Liu, Dan Song, Jiayu Li, Xuanya Li, Anjin Liu","doi":"10.1109/ICME55011.2023.00492","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00492","url":null,"abstract":"2D image-based 3D model retrieval (IBMR) usually relies on abundant explicit supervision on 2D images, together with unlabeled 3D models to learn domain-aligned yet class-discriminative features for the retrieval task. However, collecting large-scale 2D labels is cost-effective and time-consuming. Therefore, we explore a challenging IBMR task, where only few-shot labeled 2D images are available while the rest of the 2D and 3D samples remain unlabeled. Limited annotation of 2D images further increases the difficulty of domain-aligned yet discriminative feature learning. Therefore, we propose cross-domain prototype contrastive loss (CPCL) for the few-shot IBMR task. Specifically, we capture semantic information to learn class-discriminative features in each domain by minimizing intra-domain prototype contrastive loss. Besides, we perform inter-domain transferable contrastive learning to align the features between instances and prototypes of the same class across domains. Comprehensive experiments on popular benchmarks, MI3DOR and MI3DOR-2, validate the superiority of CPCL.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126958276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Privacy-Protected Facial Expression Recognition Augmented by High-Resolution Facial Images","authors":"Cong Liang, Shangfei Wang, Xiaoping Chen","doi":"10.1109/ICME55011.2023.00236","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00236","url":null,"abstract":"Cloud-based expression recognition from high-resolution facial images may put the subjects’ privacy at risk. We identify two kinds of privacy leakage, the appearance leakage in which the visual appearances of subjects are disclosed and the identity-pattern leakage in which the identity information of subjects is dug out. To address both leakages, we propose privacy-protected facial expression recognition from low-resolution facial images with the help of high-resolution facial images. Specifically, to prevent appearance leakage, we propose to extract identity-invariant representations from downsampled images, from which the visually distinguishable appearances cannot be recovered. To prevent identity-pattern leakage, we propose to eliminate the identity information from the extracted representations by leveraging the disentangled representations of high-resolution images as privileged information. After training, our method can fully capture identity-invariant representations from downsampled images for expression recognition without the requirement of high-resolution samples. These privacy-protected representations can be safely transmitted through the Internet. Experimental results in different scenarios demonstrate that the proposed method protects privacy without significantly inhibiting facial expression recognition.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130570953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mengyu Yang, Di Wu, Zelong Wang, Miao Hu, Yipeng Zhou
{"title":"Understanding and Improving Perceptual Quality of Volumetric Video Streaming","authors":"Mengyu Yang, Di Wu, Zelong Wang, Miao Hu, Yipeng Zhou","doi":"10.1109/ICME55011.2023.00339","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00339","url":null,"abstract":"Volumetric video is fully three-dimensional and provides users with highly immersive and interactive experience. However, it is difficult to stream volumetric video over the Internet due to sheer video size and limited network bandwidth. Existing solutions suffered from poor perceptual quality and low coding efficiency. In this paper, we first conduct a comprehensive user study to understand the effectiveness of popular perceptual quality metrics for volumetric video. It is observed that those metrics cannot well capture the impact of user viewing behaviors. Considering the findings that users are more sensitive to the distortion of 2D image rendered from 3D point cloud, a new metric called Volu-FMAF is proposed to better represent perceptual quality of volumetric video. Next, we propose a novel neural-based volumetric video streaming framework RenderVolu and design a distortion-aware rendered image super-resolution network, called RenDA-Net, to further improve user perceptual quality. Last, we conduct extensive experiments with real datasets to validate our proposed method, and the results show that our method can boost the perceptual quality of volumetric video by 171% to 190%, and achieves a speedup of 108x in terms of decoding efficiency compared to the state-of-the-art approaches.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130578953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wen Wang, Ling Zhong, Guang Gao, Minhong Wan, J. Gu
{"title":"CHAN: Cross-Modal Hybrid Attention Network for Temporal Language Grounding in Videos","authors":"Wen Wang, Ling Zhong, Guang Gao, Minhong Wan, J. Gu","doi":"10.1109/ICME55011.2023.00259","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00259","url":null,"abstract":"The goal of temporal language grounding (TLG) task is to temporally localize the most semantically matched video segment with respect to a given sentence query in an untrimmed video. How to effectively incorporate the cross-modal interactions between video and language is the key to improve grounding performance. Previous approaches focus on learning correlations by computing the attention matrix between each frame-word pair, while ignoring the global semantics conditioned on one modality for better associating the complex video contents and sentence query of the target modality. In this paper, we propose a novel Cross-modal Hybrid Attention Network, which integrates two parallel attention fusion modules to exploit the semantics of each modality and interactions in cross modalities. One is Intra-Modal Attention Fusion, which utilizes gated self-attention to capture the frame-by-frame and word-by-word relations conditioned on the other modality. The other is Inter-Modal Attention Fusion, which utilizes query and key features derived from different modalities to calculate the co-attention weights and further promote inter-modal fusion. Experimental results show that our CHAN significantly outperforms several existing state-of-the-arts on three challenging datasets (ActivityNet Captions, Charades-STA and TACOS), demonstrating the effectiveness of our proposed method.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132053760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}