Xinmiao Dai, Chong Wang, Haohe Li, Sunqi Lin, Lining Dong, Jiafei Wu, Jun Wang
{"title":"Synthetic Feature Assessment for Zero-Shot Object Detection","authors":"Xinmiao Dai, Chong Wang, Haohe Li, Sunqi Lin, Lining Dong, Jiafei Wu, Jun Wang","doi":"10.1109/ICME55011.2023.00083","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00083","url":null,"abstract":"Zero-shot object detection aims to simultaneously identify and localize classes that were not presented during training. Many generative model-based methods have shown promising performance by synthesizing the visual features of unseen classes from semantic embeddings. However, these synthetic features are inevitably of varied quality, which may be far from the ground truth. It degrades the performance of trained unseen classifier. Instead of tweaking the generative model, a new idea of feature quality assessment is proposed to utilize both the good and bad features to optimize the classifier in the right direction. Moreover, contrastive learning is also introduced to enhance the feature uniqueness between unseen and seen classes, which helps the feature assessment implicitly. To demonstrate the effectiveness of the proposed algorithm, comprehensive experiments are conducted on the MS COCO dataset and PASCAL VOC dataset, the state-of-the-art performance is achieved. Our code is available at: https://github.com/Dai1029/SFA-ZSD.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131723592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Multi-View Co-Learning Method for Multimodal Sentiment Analysis","authors":"Wenxiu Geng, Yulong Bian, Xiangxian Li","doi":"10.1109/ICME55011.2023.00238","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00238","url":null,"abstract":"Existing works on multimodal sentiment analysis have focused on learning more discriminative unimodal sentiment information or improving multimodal fusion methods to enhance modal complementarity. However, practical results of these methods have been limited owing to the problems of insufficient intra-modal representation and inter-modal noise. To alleviate this problem, we propose a multi-view co-learning method (MVATF) for video sentiment analysis. First, we propose a multi-view features extraction module to capture more perspectives from a single modality. Second, we propose a two-level fusion sentiment enhancement strategy that uses hierarchical attentive learning fusion and a multi-task learning fusion module to achieve co-learning to effectively filter inter-modal noise for better multimodal sentiment fusion features. Experimental results on the CH-SIMS, CMU-MOSI and MOSEI datasets show that the proposed method outperforms the state-of-the-art methods.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131758782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Domain-Invariant Feature Learning for General Face Forgery Detection","authors":"Jian Zhang, J. Ni","doi":"10.1109/ICME55011.2023.00396","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00396","url":null,"abstract":"Though existing methods for face forgery detection achieve fairly good performance under the intra-dataset scenario, few of them gain satisfying results in the case of cross-dataset testing with more practical value. To tackle this issue, in this paper, we propose a novel domain-invariant feature learning framework - DIFL for face forgery detection. In the framework, an adversarial domain generalization is introduced to learn the domain-invariant features from the forged samples synthesized by various algorithms. Then a center loss in fractional form (CL) is utilized to learn more discriminative features by aggregating the real faces while separating the fake faces from the real ones in the embedding space. In addition, a global and local random crop augmentation strategy is utilized to generate more data views of forged facial images at various scales. Extensive experimental results demonstrate the effectiveness and generalization of the proposed method compared with other state-of-the-art methods.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130882029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fixing Domain Bias for Generalized Deepfake Detection","authors":"Yuzhe Mao, Weike You, Linna Zhou, Zhigao Lu","doi":"10.1109/ICME55011.2023.00380","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00380","url":null,"abstract":"Generalizing deepfake detection has posed a great challenge to digital media forensics, as inferior performance is obtained when training sets and testing sets are domain-mismatched. In this paper, we show that a CNN-based detection model can significantly improve performance by fixing domain bias. Specifically, we propose a novel Fixing Domain Bias network (FDBN). FDBN does not rely on manual features, but is based on three core designs. Firstly, a domain-invariant network based on randomly stylized normalization is devised to constrain the domain discrepancy in the feature space. Then, through adversarial learning, a generalizing representation in the stylized distribution is learned to enhance the shared feature bias among manipulation methods in the domain-specific network. Finally, to encourage equality of biases among different domains, we utilize the bias extrapolation penalty strategy by suppressing the expected bias on the extremely-performing domains. Extensive experiments demonstrate that our framework achieves effectiveness and generalization towards unseen face forgeries.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131010763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wentian Xin, Hongkai Lin, Ruyi Liu, Yi Liu, Q. Miao
{"title":"Is Really Correlation Information Represented Well in Self-Attention for Skeleton-based Action Recognition?","authors":"Wentian Xin, Hongkai Lin, Ruyi Liu, Yi Liu, Q. Miao","doi":"10.1109/ICME55011.2023.00139","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00139","url":null,"abstract":"Transformer has shown significant advantages by various vision tasks. However, the lack of representation of correlation information about data properties makes it difficult to match the excellent results consistent with GCNs in skeleton-based action recognition. In this paper, we propose a Topology and Frames-guided Spatial-Temporal ConvFormer Network (TF-STCFormer), which is well suited for dynamically extracting topological and inter-frame uniqueness & co-occurrence information. Three essential components make up the proposed framework: (1) Grouped Physical-guided Spatial Transformer for focusing on learning essential spatial features and physical topology. (2) Global and Focal Temporal Transformer for promoting the relationship of different joints in consecutive frames and improving the representation of discriminative key-frames. (3) Grouped Dilation Temporal Convolution for connecting the intermediate output obtained by the previous transformers in the feature channels of different dilation. Experiments on four standard datasets (NTU RGB+D, NTU RGB+D 120, NW-UCLA, and UAV-Human) demonstrate that our approach prominently outperforms state-of-the-art methods on all benchmarks.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"120 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133686514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Peer Upsampled Transform Domain Prediction for G-PCC","authors":"Wenyi Wang, Yingzhan Xu, Kai Zhang, Li Zhang","doi":"10.1109/ICME55011.2023.00127","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00127","url":null,"abstract":"To meet the growing demand for point cloud compression, MPEG is developing a point cloud compression standard called as G-PCC. In G-PCC, upsampled transform domain prediction (UTDP) is used to improve attribute coding performance. However, only the attributes in the previous level can be used to predict the attributes of transform sub-blocks in UTDP, which limits the efficiency of UTDP. To address this limitation, we propose a method called peer-UTDP to improve UTDP by using peer neighbors in this paper. With peer-UTDP, attributes of co-plane or co-line peer neighbors in the level same as that of the transform sub-block can be used as prediction in the upsampling process. Experimental results show that our method outperforms G-PCC with an average coding gain of -5.1%, -5.4%, -5.1% and -1.4% under C1 condition, and -5.1%, -5.6%, -5.6% and -1.7% under C2 condition for Y, Cb, Cr and reflectance, respectively. The proposed peer-UTDP has been adopted by G-PCC.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132726039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Microimage-based Two-step Search For Plenoptic 2.0 Video Coding","authors":"Yuqing Yang, Xin Jin, Kedeng Tong, Chen Wang, Haitian Huang","doi":"10.1109/ICME55011.2023.00437","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00437","url":null,"abstract":"The plenoptic 2.0 video can record a time-varying dense light field, which benefits many immersive visual applications such as AR/VR. However, traditional inter motion estimation methods perform inefficiently in such kinds of video sequences due to the distinctive temporal characteristics caused by the imaging principle. In this paper, a microimage-based two- step search (MTSS) is proposed to achieve a better trade-off between coding performance and coding complexity. Based on microimage focus variation analysis in imaging dynamic scenes, a microlens-diameter and matching-distance spatial search with local refinement is proposed to exploit the image correlations among the microimage and to compensate the defocused inaccuracy. Implementing the proposed motion estimation in H.266 platform VTM-11.0 and comparing with the state-of-the-art methods, obvious compression efficiency improvements are achieved with limited complexity increment, which benefits the standardization of plenoptic video coding.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131444200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Content-based Viewport Prediction Framework for 360° Video Using Personalized Federated Learning and Fusion Techniques","authors":"Mehdi Setayesh, V. Wong","doi":"10.1109/ICME55011.2023.00118","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00118","url":null,"abstract":"Viewport prediction is a key enabler for 360° video streaming over wireless networks. To improve the prediction accuracy, a common approach is to use a content-based viewport prediction model. Saliency detection based on traditional convolutional neural networks (CNNs) suffers from distortion due to equirectangular projection. Also, the viewers may have their own viewing behavior and are not willing to share their historical head movement with others. To address the aforementioned issues, in this paper, we first develop a saliency detection model using a spherical CNN (SPCNN). Then, we train the viewers’ head movement prediction model using personalized federated learning (PFL). Finally, we propose a content-based viewport prediction framework by integrating the video saliency map and the head orientation map of each viewer using fusion techniques. The experimental results show that our proposed framework provides higher average accuracy and precision when compared with three state-of-the-art algorithms from the literature.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127858809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Level Feature-Guided Stereoscopic Video Quality Assessment Based on Transformer and Convolutional Neural Network","authors":"Yuan Chen, Sumei Li","doi":"10.1109/ICME55011.2023.00428","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00428","url":null,"abstract":"Stereoscopic video (3D video) has been increasingly applied in industry and entertainment. And the research of stereoscopic video quality assessment (SVQA) has become very important for promoting the development of stereoscopic video system. Many CNN-based models have emerged for SVQA task. However, these methods ignore the significance of the global information of the video frames for quality perception. In this paper, we propose a multi-level feature-fusion model based on Transformer and convolutional neural network (MFFTCNet) to assess the perceptual quality of the stereoscopic video. Firstly, we use global information from Transformer to guide local information from convolutional neural network (CNN). Moreover, we utilize low-level features in the CNN branch to guide high-level features. Besides, considering the binocular rivalry effect in the human vision system (HVS), we use 3D convolution to achieve rivalry fusion of binocular features. The proposed method is tested on two public stereoscopic video quality datasets. The result shows that this method correlates highly with human visual perception and outperforms state-of-the-art (SOTA) methods by a significant margin.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127521844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hidden Follower Detection via Refined Gaze and Walking State Estimation","authors":"Yaxi Chen, Ruimin Hu, Danni Xu, Zheng Wang, Linbo Luo, Dengshi Li","doi":"10.1109/ICME55011.2023.00356","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00356","url":null,"abstract":"Hidden following is following behavior with special intentions, and detecting hidden following behavior can prevent many criminal activities in advance. The previous method uses gaze and spacing behaviors to distinguish hidden followers from normal pedestrians. However, they express gaze behaviors in a coarse-grained way with binary values, making it difficult to accurately depict the gaze state of pedestrians. To this end, we propose the Refined Hidden Follower Detection (RHFD) model by choosing a suitable mapping function based on the principle that the closer the gaze direction is to someone, the more likely it is to gaze at someone, which converts the gaze direction into a continuous estimated gaze state representing the complex and variable gaze behavior of pedestrians. Simultaneously, we introduce variations in the magnitude and direction of pedestrian velocity to refine the representation of pedestrian walking states. Experimental results on the surveillance dataset show that RHFD outperforms state-of-the-art methods.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124115706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}