{"title":"Invisible Intruders: Label-Consistent Backdoor Attack Using Re-Parameterized Noise Trigger","authors":"Bo Wang;Fei Yu;Fei Wei;Yi Li;Wei Wang","doi":"10.1109/TMM.2024.3412388","DOIUrl":"10.1109/TMM.2024.3412388","url":null,"abstract":"Aremarkable number of backdoor attack methods have been proposed in the literature on deep neural networks (DNNs). However, it hasn't been sufficiently addressed in the existing methods of achieving true senseless backdoor attacks that are visually invisible and label-consistent. In this paper, we propose a new backdoor attack method where the labels of the backdoor images are perfectly aligned with their content, ensuring label consistency. Additionally, the backdoor trigger is meticulously designed, allowing the attack to evade DNN model checks and human inspection. Our approach employs an auto-encoder (AE) to conduct representation learning of benign images and interferes with salient classification features to increase the dependence of backdoor image classification on backdoor triggers. To ensure visual invisibility, we implement a method inspired by image steganography that embeds trigger patterns into the image using the DNN and enable sample-specific backdoor triggers. We conduct comprehensive experiments on multiple benchmark datasets and network architectures to verify the effectiveness of our proposed method under the metric of attack success rate and invisibility. The results also demonstrate satisfactory performance against a variety of defense methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10766-10778"},"PeriodicalIF":8.4,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Point Clouds are Specialized Images: A Knowledge Transfer Approach for 3D Understanding","authors":"Jiachen Kang;Wenjing Jia;Xiangjian He;Kin Man Lam","doi":"10.1109/TMM.2024.3412330","DOIUrl":"10.1109/TMM.2024.3412330","url":null,"abstract":"Self-supervised representation learning (SSRL) has gained increasing attention in point cloud understanding, in addressing the challenges posed by 3D data scarcity and high annotation costs. This paper presents PCExpert, a novel SSRL approach that reinterprets point clouds as “specialized images”. This conceptual shift allows PCExpert to leverage knowledge derived from large-scale image modality in a more direct and deeper manner, via extensively sharing the parameters with a pre-trained image encoder in a multi-way Transformer architecture. The parameter sharing strategy, combined with an additional pretext task for pre-training, i.e., transformation estimation, empowers PCExpert to outperform the state of the arts in a variety of tasks, with a remarkable reduction in the number of trainable parameters. Notably, PCExpert's performance under \u0000<monospace>LINEAR</monospace>\u0000 fine-tuning (e.g., yielding a 90.02% overall accuracy on ScanObjectNN) has already closely approximated the results obtained with \u0000<monospace>FULL</monospace>\u0000 model fine-tuning (92.66%), demonstrating its effective representation capability.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10755-10765"},"PeriodicalIF":8.4,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liang Zhang;Jiangwei Zhao;Qingbo Wu;Lili Pan;Hongliang Li
{"title":"InfoUCL: Learning Informative Representations for Unsupervised Continual Learning","authors":"Liang Zhang;Jiangwei Zhao;Qingbo Wu;Lili Pan;Hongliang Li","doi":"10.1109/TMM.2024.3412389","DOIUrl":"10.1109/TMM.2024.3412389","url":null,"abstract":"Unsupervised continual learning (UCL) has made remarkable progress over the past two years, significantly expanding the application of continual learning (CL). However, existing UCL approaches have only focused on transferring continual strategies from supervised to unsupervised. They have overlooked the relationship issue between visual features and representational continuity. This work draws attention to the texture bias problem in existing UCL methods. To address this problem, we propose a new UCL framework called InfoUCL, in which we develop InfoDrop contrastive loss to guide continual learners to extract more informative shape features of objects and discard useless texture features simultaneously. The proposed InfoDrop contrastive loss is general and can be combined with various UCL methods. Extensive experiments on various benchmarks have demonstrated that our InfoUCL framework can lead to higher classification accuracy and superior robustness to catastrophic forgetting.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10779-10791"},"PeriodicalIF":8.4,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Label-Guided Dynamic Spatial-Temporal Fusion for Video-Based Facial Expression Recognition","authors":"Ziyang Zhang;Xiang Tian;Yuan Zhang;Kailing Guo;Xiangmin Xu","doi":"10.1109/TMM.2024.3407693","DOIUrl":"10.1109/TMM.2024.3407693","url":null,"abstract":"Video-based facial expression recognition (FER) in the wild is a common yet challenging task. Extracting spatial and temporal features simultaneously is a common approach but may not always yield optimal results due to the distinct nature of spatial and temporal information. Extracting spatial and temporal features cascadingly has been proposed as an alternative approach However, the results of video-based FER sometimes fall short compared to image-based FER, indicating underutilization of spatial information of each frame and suboptimal modeling of frame relations in spatial-temporal fusion strategies. Although frame label is highly related to video label, it is overlooked in previous video-based FER methods. This paper proposes label-guided dynamic spatial-temporal fusion (LG-DSTF) that adopts frame labels to enhance the discriminative ability of spatial features and guide temporal fusion. By assigning each frame a video label, two auxiliary classification loss functions are constructed to steer discriminative spatial feature learning at different levels. The cross entropy between a uniform distribution and label distribution of spatial features is utilized to measure the classification confidence of each frame. The confidence values serve as dynamic weights to emphasize crucial frames during temporal fusion of spatial features. Our LG-DSTF achieves state-of-the-art results on FER benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10503-10513"},"PeriodicalIF":8.4,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wai Keung Wong;Dewei Lin;Yuwu Lu;Jiajun Wen;Zhihui Lai;Xuelong Li
{"title":"Correlation-Guided Distribution and Geometry Alignments for Heterogeneous Domain Adaptation","authors":"Wai Keung Wong;Dewei Lin;Yuwu Lu;Jiajun Wen;Zhihui Lai;Xuelong Li","doi":"10.1109/TMM.2024.3411316","DOIUrl":"10.1109/TMM.2024.3411316","url":null,"abstract":"In this paper, we present a novel approach named correlation-guided distribution and geometry alignments (CDGA) for heterogeneous domain adaptation. Unlike existing methods that typically combine feature alignment and domain alignment into a single objective function, our proposed CDGA separates the two alignments into distinct steps. The two adaptation steps are: paired canonical correlation analysis (PCCA) and distribution and geometry alignments (DGA). In the PCCA step, CDGA focuses on maximizing the within-category correlation between source and target samples to produce the dimension-aligned feature representations for the next adaptation step. In the DGA step, CDGA is responsible for learning a classifier that incorporates both distribution and geometry alignments. Furthermore, during this step, the highly confident pseudo labeled samples are carefully selected for the next iteration of PCCA, establishing a beneficial coupling between PCCA and DGA to improve the adaptation performance in an iterative manner. Experimental results on various visual cross-domain benchmarks demonstrate that CDGA achieves remarkable performance compared to the existing shallow heterogeneous domain adaptation methods and even exhibits superiority over the state-of-the-art neural network-based approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10741-10754"},"PeriodicalIF":8.4,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kunpeng Wang;Danying Lin;Chenglong Li;Zhengzheng Tu;Bin Luo
{"title":"Alignment-Free RGBT Salient Object Detection: Semantics-Guided Asymmetric Correlation Network and a Unified Benchmark","authors":"Kunpeng Wang;Danying Lin;Chenglong Li;Zhengzheng Tu;Bin Luo","doi":"10.1109/TMM.2024.3410542","DOIUrl":"10.1109/TMM.2024.3410542","url":null,"abstract":"RGB and Thermal (RGBT) Salient Object Detection (SOD) aims to achieve high-quality saliency prediction by exploiting the complementary information of visible and thermal image pairs, which are initially captured in an unaligned manner. However, existing methods are tailored for manually aligned image pairs, which are labor-intensive, and directly applying these methods to original unaligned image pairs could significantly degrade their performance. In this paper, we make the first attempt to address RGBT SOD for initially captured RGB and thermal image pairs without manual alignment. Specifically, we propose a Semantics-guided Asymmetric Correlation Network (SACNet) that consists of two novel components: 1) an asymmetric correlation module utilizing semantics-guided attention to model cross-modal correlations specific to unaligned salient regions; 2) an associated feature sampling module to sample relevant thermal features according to the corresponding RGB features for multi-modal feature integration. In addition, we construct a unified benchmark dataset called UVT2000, containing 2000 RGB and thermal image pairs directly captured from various real-world scenes without any alignment, to facilitate research on alignment-free RGBT SOD. Extensive experiments on both aligned and unaligned datasets demonstrate the effectiveness and superior performance of our method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10692-10707"},"PeriodicalIF":8.4,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141388230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junyu Chen;Jie An;Hanjia Lyu;Christopher Kanan;Jiebo Luo
{"title":"Learning to Evaluate the Artness of AI-Generated Images","authors":"Junyu Chen;Jie An;Hanjia Lyu;Christopher Kanan;Jiebo Luo","doi":"10.1109/TMM.2024.3410672","DOIUrl":"10.1109/TMM.2024.3410672","url":null,"abstract":"Assessing the artness of AI-generated images continues to be a challenge within the realm of image generation. Most existing metrics cannot be used to perform instance-level and reference-free artness evaluation. This paper presents ArtScore, a metric designed to evaluate the degree to which an image resembles authentic artworks by artists (or conversely photographs), thereby offering a novel approach to artness assessment. We first blend pre-trained models for photo and artwork generation, resulting in a series of mixed models. Subsequently, we utilize these mixed models to generate images exhibiting varying degrees of artness with pseudo-annotations. Each photorealistic image has a corresponding artistic counterpart and a series of interpolated images that range from realistic to artistic. This dataset is then employed to train a neural network that learns to estimate quantized artness levels of arbitrary images. Extensive experiments reveal that the artness levels predicted by ArtScore \u0000<bold>align more closely with human artistic evaluation than existing evaluation metrics</b>\u0000, such as Gram loss and ArtFID.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10731-10740"},"PeriodicalIF":8.4,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lightweight Model Pre-Training via Language Guided Knowledge Distillation","authors":"Mingsheng Li;Lin Zhang;Mingzhen Zhu;Zilong Huang;Gang Yu;Jiayuan Fan;Tao Chen","doi":"10.1109/TMM.2024.3410532","DOIUrl":"10.1109/TMM.2024.3410532","url":null,"abstract":"This paper studies the problem of pre-training for small models, which is essential for many mobile devices. Current state-of-the-art methods on this problem transfer the representational knowledge of a large network (as a Teacher) into a smaller model (as a Student) using self-supervised distillation, improving the performance of the small model on downstream tasks. However, existing approaches are insufficient in extracting the crucial knowledge that is useful for discerning categories in downstream tasks during the distillation process. In this paper, for the first time, we introduce language guidance to the distillation process and propose a new method named Language-Guided Distillation (LGD) system, which uses category names of the target downstream task to help refine the knowledge transferred between the teacher and student. To this end, we utilize a pre-trained text encoder to extract semantic embeddings from language and construct a textual semantic space called Textual Semantics Bank (TSB). Furthermore, we design a Language-Guided Knowledge Aggregation (LGKA) module to construct the visual semantic space, also named Visual Semantics Bank (VSB). The task-related knowledge is transferred by driving a student encoder to mimic the similarity score distribution inferred by a teacher over TSB and VSB. Compared with other small models obtained by either ImageNet pre-training or self-supervised distillation, experiment results show that the distilled lightweight model using the proposed LGD method presents state-of-the-art performance and is validated on various downstream tasks, including classification, detection, and segmentation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10720-10730"},"PeriodicalIF":8.4,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Joint-Limb Compound Triangulation With Co-Fixing for Stereoscopic Human Pose Estimation","authors":"Zhuo Chen;Xiaoyue Wan;Yiming Bao;Xu Zhao","doi":"10.1109/TMM.2024.3410514","DOIUrl":"10.1109/TMM.2024.3410514","url":null,"abstract":"As a special subset of multi-view settings for 3D human pose estimation, stereoscopic settings show promising applications in practice since they are not ill-posed but could be as mobile as monocular ones. However, when there are only two views, the problems of occlusions and “double counting” (ambiguity between symmetric joints) pose greater challenges that are not addressed by previous approaches. On this concern, we propose a novel framework to detect limb orientations in field form and incorporate them explicitly with joint features. Two modules are proposed to realize the fusion. At 3D level, we design \u0000<italic>compound triangulation</i>\u0000 as an explicit module that produces the optimal pose using 2D joint locations and limb orientations. The module is derived from reformulating triangulation in 3D space, and expanding it with the optimization of limb orientations. At 2D level, we propose a parameter-free module named \u0000<italic>co-fixing</i>\u0000 to enable joint and limb features to fix each other to alleviate the impact of “double counting.” Features from both parts are first used to infer each other via simple convolutions and then fixed by the inferred ones respectively. We test our method on two public benchmarks, Human3.6M and Total Capture, and our method achieves state-of-the-art performance on stereoscopic settings and comparable results on common 4-view benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10708-10719"},"PeriodicalIF":8.4,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VirPNet: A Multimodal Virtual Point Generation Network for 3D Object Detection","authors":"Lin Wang;Shiliang Sun;Jing Zhao","doi":"10.1109/TMM.2024.3410117","DOIUrl":"10.1109/TMM.2024.3410117","url":null,"abstract":"LiDAR and camera are the most common used sensors to percept the road scenes in autonomous driving. Current methods tried to fuse the two complementary information to boost 3D object detection. However, there are still two burning problems for multi-modality 3D object detection. One is the detection problem for the objects with sparse point clouds. The other is the misalignment of different sensors caused by the fixed physical locations. Therefore, this paper argues that explicitly fusing information from the two modalities with the physical misalignment is suboptimal for multi-modality 3D object detection. This paper presents a novel virtual point generation network, VirPNet, to overcome the multi-modality fusion challenges. On one hand, it completes sparse point cloud objects from image source and improves the final detection accuracy. On the other hand, it directly detects 3D targets from raw point clouds to avoid the physical misalignment between LiDAR and camera sensors. Different from previous point cloud completion methods, VirPNet fully utilizes the geometric information of pixels and point clouds and simplifies 3D point cloud regression into a 2D distance regression problem through a virtual plane. Experimental results on KITTI 3D object detection dataset and nuScenes dataset demonstrate that VirPNet improves the detection accuracy with the help of the generated virtual points.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10597-10609"},"PeriodicalIF":8.4,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}