{"title":"Primary Code Guided Targeted Attack against Cross-modal Hashing Retrieval","authors":"Xinru Guo;Huaxiang Zhang;Li Liu;Dongmei Liu;Xu Lu;Hui Meng","doi":"10.1109/TMM.2024.3521697","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521697","url":null,"abstract":"Deep hashing algorithms have demonstrated considerable success in recent years, particularly in cross-modal retrieval tasks. Although hash-based cross-modal retrieval methods have demonstrated considerable efficacy, the vulnerability of deep networks to adversarial examples represents a significant challenge for the hash retrieval. In the absence of target semantics, previous non-targeted attack methods attempt to attack depth models by adding disturbance to the input data, yielding some positive outcomes. Nevertheless, they still lack specific instance-level hash codes and fail to consider the diversity and semantic association of different modalities, which is insufficient to meet the attacker's expectations. In response, we present a novel Primary code Guided Targeted Attack (PGTA) against cross-modal hashing retrieval. Specifically, we integrate cross-modal instances and labels to obtain well-fused target semantics, thereby enhancing cross-modal interaction. Secondly, the primary code is designed to generate discriminable information with fine-grained semantics for target labels. Benign samples and target semantics collectively generate adversarial examples under the guidance of primary codes, thereby enhancing the efficacy of targeted attacks. Extensive experiments demonstrate that our PGTA outperforms the most advanced methods on three datasets, achieving State-of-the-Art targeted attack performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"312-326"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shichao Zhang;Yibo Ding;Tianxiang Huo;Shukai Duan;Lidan Wang
{"title":"PointAttention: Rethinking Feature Representation and Propagation in Point Cloud","authors":"Shichao Zhang;Yibo Ding;Tianxiang Huo;Shukai Duan;Lidan Wang","doi":"10.1109/TMM.2024.3521745","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521745","url":null,"abstract":"Self-attention mechanisms have revolutionized natural language processing and computer vision. However, in point cloud analysis, most existing methods focus on point convolution operators for feature extraction, but fail to model long-range and hierarchical dependencies. To overcome above issues, in this paper, we present PointAttention, a novel network for point cloud feature representation and propagation. Specifically, this architecture uses a two-stage Learnable Self-attention for long-range attention weights learning, which is more effective than conventional triple attention. Furthermore, it employs a Hierarchical Learnable Attention Mechanism to formulate momentous global prior representation and perform fine-grained context understanding, which enables our framework to break through the limitation of the receptive field and reduce the loss of contexts. Interestingly, we show that the proposed Learnable Self-attention is equivalent to the coupling of two Softmax attention operations while having lower complexity. Extensive experiments demonstrate that our network achieves highly competitive performance on several challenging publicly available benchmarks, including point cloud classification on ScanObjectNN and ModelNet40, and part segmentation on ShapeNet-Part.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"327-339"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiguang Miao;Wentian Xin;Ruyi Liu;Yi Liu;Mengyao Wu;Cheng Shi;Chi-Man Pun
{"title":"Adaptive Pitfall: Exploring the Effectiveness of Adaptation in Skeleton-Based Action Recognition","authors":"Qiguang Miao;Wentian Xin;Ruyi Liu;Yi Liu;Mengyao Wu;Cheng Shi;Chi-Man Pun","doi":"10.1109/TMM.2024.3521774","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521774","url":null,"abstract":"Graph convolution networks (GCNs) have achieved remarkable performance in skeleton-based action recognition by exploiting the adjacency topology of body representation. However, the adaptive strategy adopted by the previous methods to construct the adjacency matrix is not balanced between the performance and the computational cost. We assume this concept of <italic>Adaptive Trap</i>, which can be replaced by multiple autonomous submodules, thereby simultaneously enhancing the dynamic joint representation and effectively reducing network resources. To effectuate the substitution of the adaptive model, we unveil two distinct strategies, both yielding comparable effects. (1) Optimization. <italic>Individuality and Commonality GCNs (IC-GCNs)</i> is proposed to specifically optimize the construction method of the associativity adjacency matrix for adaptive processing. The uniqueness and co-occurrence between different joint points and frames in the skeleton topology are effectively captured through methodologies like preferential fusion of physical information, extreme compression of multi-dimensional channels, and simplification of self-attention mechanism. (2) Replacement. <italic>Auto-Learning GCNs (AL-GCNs)</i> is proposed to boldly remove popular adaptive modules and cleverly utilize human key points as motion compensation to provide dynamic correlation support. AL-GCNs construct a fully learnable group adjacency matrix in both spatial and temporal dimensions, resulting in an elegant and efficient GCN-based model. In addition, three effective tricks for skeleton-based action recognition (Skip-Block, Bayesian Weight Selection Algorithm, and Simplified Dimensional Attention) are exposed and analyzed in this paper. Finally, we employ the variable channel and grouping method to explore the hardware resource bound of the two proposed models. IC-GCN and AL-GCN exhibit impressive performance across NTU-RGB+D 60, NTU-RGB+D 120, NW-UCLA, and UAV-Human datasets, with an exceptional parameter-cost ratio.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"56-71"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cairong Zhao;Rui Shu;Shuyang Feng;Liang Zhu;Xuekuan Wang
{"title":"Scene Text Image Super-Resolution Via Semantic Distillation and Text Perceptual Loss","authors":"Cairong Zhao;Rui Shu;Shuyang Feng;Liang Zhu;Xuekuan Wang","doi":"10.1109/TMM.2024.3521759","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521759","url":null,"abstract":"Text Super-Resolution (SR) technology aims to recover lost information in low-resolution text images. With the proposal of TextZoom, which is the first dataset aiming at text super-resolution in real scenes, more and more scene text super-resolution models have been presented on the basis of it. Although these methods have achieved excellent performance, they do not consider how to make full and efficient use of semantic information. Out of this consideration, a Semantic-aware Trident Network (STNet) for Scene Text Image Super-Resolution is proposed. Specifically, pre-trained text recognition model ASTER (Attentional Scene Text Recognizer) is utilized to assist this process in two ways. Firstly, a novel basic block named Semantic-aware Trident Block (STB) is designed to build the STNet, which incorporates an added branch for semantic distillation to learn semantic information of pre-trained recognition model. Secondly, we expand our model in an adversarial training manner and propose new text perceptual loss based on ASTER to further enhance semantic information in SR images. Extensive experiments on TextZoom dataset show that compared with directly recognizing bicubic images, the proposed STNet boosts the recognition accuracy of ASTER, MORAN (Multi-Object Rectified Attention Network), and CRNN (Convolutional Recurrent Neural Network) by 17.4%, 18.2%, and 24.3%, respectively, which is higher than the performance of several existing state-of-the-art (SOTA) SR network models. Besides, experiments in real scenes (on ICDAR 2015 dataset) and in restricted scenarios (defense against adversarial attacks) validate that addition of semantic information enables the proposed method to achieve promising cross-dataset performance. Since the proposed method is trained on cropped images, when applied to real-world scenarios, locations of text in natural images are firstly localized through scene text detection methods, and then cropped text images are obtained based on detected text positions.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1153-1164"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"STNet: Deep Audio–Visual Fusion Network for Robust Speaker Tracking","authors":"Yidi Li;Hong Liu;Bing Yang","doi":"10.1109/TMM.2024.3521737","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521737","url":null,"abstract":"Audio-visual speaker tracking aims to determine the location of human targets in a scene using signals captured by a multi-sensor platform, whose accuracy and robustness can be improved by multi-modal fusion methods. Recently, several fusion methods have been proposed to model the correlation in multiple modalities. However, for the speaker tracking problem, the cross-modal interaction between audio and visual signals hasn't been well exploited. To this end, we present a novel Speaker Tracking Network (STNet) with a deep audio-visual fusion model in this work. We design a visual-guided acoustic measurement method to fuse heterogeneous cues in a unified localization space, which employs visual observations via a camera model to construct the enhanced acoustic map. For feature fusion, a cross-modal attention module is adopted to jointly model multi-modal contexts and interactions. The correlated information between audio and visual features is further interacted in the fusion model. Moreover, the STNet-based tracker is applied to multi-speaker cases by a quality-aware module, which evaluates the reliability of multi-modal observations to achieve robust tracking in complex scenarios. Experiments on the AV16.3 and CAV3D datasets show that the proposed STNet-based tracker outperforms uni-modal methods and state-of-the-art audio-visual speaker trackers.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1835-1847"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shunjie Yuan;Xinghua Li;Yinbin Miao;Haiyan Zhang;Ximeng Liu;Robert H. Deng
{"title":"Combating Noisy Labels by Alleviating the Memorization of DNNs to Noisy Labels","authors":"Shunjie Yuan;Xinghua Li;Yinbin Miao;Haiyan Zhang;Ximeng Liu;Robert H. Deng","doi":"10.1109/TMM.2024.3521722","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521722","url":null,"abstract":"Data is the essential fuel for deep neural networks (DNNs), and its quality affects the practical performance of DNNs. In real-world training scenarios, the successful generalization performance of DNNs is severely challenged by noisy samples with incorrect labels. To combat noisy samples in image classification, numerous methods based on sample selection and semi-supervised learning (SSL) have been developed, where sample selection is used to provide the supervision signal for SSL, achieving great success in resisting noisy samples. Due to the necessary warm-up training on noisy datasets and the basic sample selection mechanism, DNNs are still confronted with the challenge of memorizing noisy samples. However, existing methods do not address the memorization of noisy samples by DNNs explicitly, which hinders the generalization performance of DNNs. To alleviate this issue, we present a new approach to combat noisy samples. First, we propose a memorized noise detection method to detect noisy samples that DNNs have already memorized during the training process. Next, we design a noise-excluded sample selection method and a noise-alleviated MixMatch to alleviate the memorization of DNNs to noisy samples. Finally, we integrate our approach with the established method DivideMix, proposing Modified-DivideMix. The experimental results on CIFAR-10, CIFAR-100, and Clothing1M demonstrate the effectiveness of our approach.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"597-609"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prototype Alignment With Dedicated Experts for Test-Agnostic Long-Tailed Recognition","authors":"Chen Guo;Weiling Chen;Aiping Huang;Tiesong Zhao","doi":"10.1109/TMM.2024.3521665","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521665","url":null,"abstract":"Unlike vanilla long-tailed recognition trains on imbalanced data but assumes a uniform test class distribution, test-agnostic long-tailed recognition aims to handle arbitrary test class distributions. Existing methods require prior knowledge of test sets for post-adjustment through multi-stage training, resulting in static decisions at the dataset-level. This pipeline overlooks instance diversity and is impractical in real situations. In this work, we introduce Prototype Alignment with Dedicated Experts (PADE), a one-stage framework for test-agnostic long-tailed recognition. PADE tackles unknown test distributions at the instance-level, without depending on test priors. It reformulates the task as a domain detection problem, dynamically adjusting the model for each instance. PADE comprises three main strategies: 1) parameter customization strategy for multi-experts skilled at different categories; 2) normalized target knowledge distillation for mutual guidance among experts while maintaining diversity; 3) re-balanced compactness learning with momentum prototypes, promoting instance alignment with the corresponding class centroid. We evaluate PADE on various long-tailed recognition benchmarks with diverse test distributions. The results verify its effectiveness in both vanilla and test-agnostic long-tailed recognition.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"455-465"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Content-Aware Tunable Selective Encryption for HEVC Using Sine-Modular Chaotification Model","authors":"Qingxin Sheng;Chong Fu;Zhaonan Lin;Junxin Chen;Xingwei Wang;Chiu-Wing Sham","doi":"10.1109/TMM.2024.3521724","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521724","url":null,"abstract":"Existing High Efficiency Video Coding (HEVC) selective encryption algorithms only consider the encoding characteristics of syntax elements to keep format compliance, but ignore the semantic features of video content, which may lead to unnecessary computational and bit rate costs. To tackle this problem, we present a content-aware tunable selective encryption (CATSE) scheme for HEVC. First, a deep hashing network is adopted to retrieve groups of pictures (GOPs) containing sensitive objects. Then, the retrieved sensitive GOPs and the remaining insensitive ones are encrypted with different encryption strengths. For the former, multiple syntax elements are encrypted to ensure security, whereas for the latter, only a few bypass-coded syntax elements are encrypted to improve the encryption efficiency and reduce the bit rate overhead. The keystream sequence used is extracted from the time series of a new improved logistic map with complex dynamic behavior, which is generated by our proposed sine-modular chaotification model. Finally, a reversible steganography is applied to embed the flag bits of the GOP type into the encrypted bitstream, so that the decoder can distinguish the encrypted syntax elements that need to be decrypted in different GOPs. Experimental results indicate that the proposed HEVC CATSE scheme not only provides high encryption speed and low bit rate overhead, but also has superior encryption strength than other state-of-the-art HEVC selective encryption algorithms.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"41-55"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quanquan Xiao;Haiyan Jin;Haonan Su;Yuanlin Zhang;Zhaolin Xiao;Bin Wang
{"title":"SPDFusion:A Semantic Prior Knowledge-Driven Method for Infrared and Visible Image Fusion","authors":"Quanquan Xiao;Haiyan Jin;Haonan Su;Yuanlin Zhang;Zhaolin Xiao;Bin Wang","doi":"10.1109/TMM.2024.3521848","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521848","url":null,"abstract":"Infrared and visible image fusion is currently an important research direction in the field of multimodal image fusion, which aims to utilize the complementary information between infrared images and visible images to generate a new image containing richer information. In recent years, many deep learning-based methods for infrared and visible image fusion have emerged.However, most of these approaches ignore the importance of semantic information in image fusion, resulting in the generation of fused images that do not perform well enough in human visual perception and advanced visual tasks.To address this problem, we propose a semantic prior knowledge-driven infrared and visible image fusion method. The method utilizes a pre-trained semantic segmentation model to acquire semantic information of infrared and visible images, and drives the fusion process of infrared and visible images through semantic feature perception module and semantic feature embedding module.Meanwhile, we divide the fused image into each category block and consider them as components, and utilize the regional semantic adversarial loss to enhance the adversarial network generation ability in different regions, thus improving the quality of the fused image.Through extensive experiments on widely used datasets, the results show that our approach outperforms current leading algorithms in both human eye visualization and advanced visual tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1691-1705"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingcheng Ke;Dele Wang;Jun-Cheng Chen;I-Hong Jhuo;Chia-Wen Lin;Yen-Yu Lin
{"title":"Make Graph-Based Referring Expression Comprehension Great Again Through Expression-Guided Dynamic Gating and Regression","authors":"Jingcheng Ke;Dele Wang;Jun-Cheng Chen;I-Hong Jhuo;Chia-Wen Lin;Yen-Yu Lin","doi":"10.1109/TMM.2024.3521844","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521844","url":null,"abstract":"One common belief is that with complex models and pre-training on large-scale datasets, transformer-based methods for referring expression comprehension (REC) perform much better than existing graph-based methods. We observe that since most graph-based methods adopt an off-the-shelf detector to locate candidate objects (i.e., regions detected by the object detector), they face two challenges that result in subpar performance: (1) the presence of significant noise caused by numerous irrelevant objects during reasoning, and (2) inaccurate localization outcomes attributed to the provided detector. To address these issues, we introduce a plug-and-adapt module guided by sub-expressions, called dynamic gate constraint (DGC), which can adaptively disable irrelevant proposals and their connections in graphs during reasoning. We further introduce an expression-guided regression strategy (EGR) to refine location prediction. Extensive experimental results on the RefCOCO, RefCOCO+, RefCOCOg, Flickr30 K, RefClef, and Ref-reasoning datasets demonstrate the effectiveness of the DGC module and the EGR strategy in consistently boosting the performances of various graph-based REC methods. Without any pretaining, the proposed graph-based method achieves better performance than the state-of-the-art (SOTA) transformer-based methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1950-1961"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143801060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}