{"title":"Prototype Alignment With Dedicated Experts for Test-Agnostic Long-Tailed Recognition","authors":"Chen Guo;Weiling Chen;Aiping Huang;Tiesong Zhao","doi":"10.1109/TMM.2024.3521665","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521665","url":null,"abstract":"Unlike vanilla long-tailed recognition trains on imbalanced data but assumes a uniform test class distribution, test-agnostic long-tailed recognition aims to handle arbitrary test class distributions. Existing methods require prior knowledge of test sets for post-adjustment through multi-stage training, resulting in static decisions at the dataset-level. This pipeline overlooks instance diversity and is impractical in real situations. In this work, we introduce Prototype Alignment with Dedicated Experts (PADE), a one-stage framework for test-agnostic long-tailed recognition. PADE tackles unknown test distributions at the instance-level, without depending on test priors. It reformulates the task as a domain detection problem, dynamically adjusting the model for each instance. PADE comprises three main strategies: 1) parameter customization strategy for multi-experts skilled at different categories; 2) normalized target knowledge distillation for mutual guidance among experts while maintaining diversity; 3) re-balanced compactness learning with momentum prototypes, promoting instance alignment with the corresponding class centroid. We evaluate PADE on various long-tailed recognition benchmarks with diverse test distributions. The results verify its effectiveness in both vanilla and test-agnostic long-tailed recognition.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"455-465"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Content-Aware Tunable Selective Encryption for HEVC Using Sine-Modular Chaotification Model","authors":"Qingxin Sheng;Chong Fu;Zhaonan Lin;Junxin Chen;Xingwei Wang;Chiu-Wing Sham","doi":"10.1109/TMM.2024.3521724","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521724","url":null,"abstract":"Existing High Efficiency Video Coding (HEVC) selective encryption algorithms only consider the encoding characteristics of syntax elements to keep format compliance, but ignore the semantic features of video content, which may lead to unnecessary computational and bit rate costs. To tackle this problem, we present a content-aware tunable selective encryption (CATSE) scheme for HEVC. First, a deep hashing network is adopted to retrieve groups of pictures (GOPs) containing sensitive objects. Then, the retrieved sensitive GOPs and the remaining insensitive ones are encrypted with different encryption strengths. For the former, multiple syntax elements are encrypted to ensure security, whereas for the latter, only a few bypass-coded syntax elements are encrypted to improve the encryption efficiency and reduce the bit rate overhead. The keystream sequence used is extracted from the time series of a new improved logistic map with complex dynamic behavior, which is generated by our proposed sine-modular chaotification model. Finally, a reversible steganography is applied to embed the flag bits of the GOP type into the encrypted bitstream, so that the decoder can distinguish the encrypted syntax elements that need to be decrypted in different GOPs. Experimental results indicate that the proposed HEVC CATSE scheme not only provides high encryption speed and low bit rate overhead, but also has superior encryption strength than other state-of-the-art HEVC selective encryption algorithms.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"41-55"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quanquan Xiao;Haiyan Jin;Haonan Su;Yuanlin Zhang;Zhaolin Xiao;Bin Wang
{"title":"SPDFusion:A Semantic Prior Knowledge-Driven Method for Infrared and Visible Image Fusion","authors":"Quanquan Xiao;Haiyan Jin;Haonan Su;Yuanlin Zhang;Zhaolin Xiao;Bin Wang","doi":"10.1109/TMM.2024.3521848","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521848","url":null,"abstract":"Infrared and visible image fusion is currently an important research direction in the field of multimodal image fusion, which aims to utilize the complementary information between infrared images and visible images to generate a new image containing richer information. In recent years, many deep learning-based methods for infrared and visible image fusion have emerged.However, most of these approaches ignore the importance of semantic information in image fusion, resulting in the generation of fused images that do not perform well enough in human visual perception and advanced visual tasks.To address this problem, we propose a semantic prior knowledge-driven infrared and visible image fusion method. The method utilizes a pre-trained semantic segmentation model to acquire semantic information of infrared and visible images, and drives the fusion process of infrared and visible images through semantic feature perception module and semantic feature embedding module.Meanwhile, we divide the fused image into each category block and consider them as components, and utilize the regional semantic adversarial loss to enhance the adversarial network generation ability in different regions, thus improving the quality of the fused image.Through extensive experiments on widely used datasets, the results show that our approach outperforms current leading algorithms in both human eye visualization and advanced visual tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1691-1705"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingcheng Ke;Dele Wang;Jun-Cheng Chen;I-Hong Jhuo;Chia-Wen Lin;Yen-Yu Lin
{"title":"Make Graph-Based Referring Expression Comprehension Great Again Through Expression-Guided Dynamic Gating and Regression","authors":"Jingcheng Ke;Dele Wang;Jun-Cheng Chen;I-Hong Jhuo;Chia-Wen Lin;Yen-Yu Lin","doi":"10.1109/TMM.2024.3521844","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521844","url":null,"abstract":"One common belief is that with complex models and pre-training on large-scale datasets, transformer-based methods for referring expression comprehension (REC) perform much better than existing graph-based methods. We observe that since most graph-based methods adopt an off-the-shelf detector to locate candidate objects (i.e., regions detected by the object detector), they face two challenges that result in subpar performance: (1) the presence of significant noise caused by numerous irrelevant objects during reasoning, and (2) inaccurate localization outcomes attributed to the provided detector. To address these issues, we introduce a plug-and-adapt module guided by sub-expressions, called dynamic gate constraint (DGC), which can adaptively disable irrelevant proposals and their connections in graphs during reasoning. We further introduce an expression-guided regression strategy (EGR) to refine location prediction. Extensive experimental results on the RefCOCO, RefCOCO+, RefCOCOg, Flickr30 K, RefClef, and Ref-reasoning datasets demonstrate the effectiveness of the DGC module and the EGR strategy in consistently boosting the performances of various graph-based REC methods. Without any pretaining, the proposed graph-based method achieves better performance than the state-of-the-art (SOTA) transformer-based methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1950-1961"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143801060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HNR-ISC: Hybrid Neural Representation for Image Set Compression","authors":"Pingping Zhang;Shiqi Wang;Meng Wang;Peilin Chen;Wenhui Wu;Xu Wang;Sam Kwong","doi":"10.1109/TMM.2024.3521715","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521715","url":null,"abstract":"Image set compression (ISC) refers to compressing the sets of semantically similar images. Traditional ISC methods typically aim to eliminate redundancy among images at either signal or frequency domain, but often struggle to handle complex geometric deformations across different images effectively. Here, we propose a new Hybrid Neural Representation for ISC (HNR-ISC), including an implicit neural representation for Semantically Common content Compression (SCC) and an explicit neural representation for Semantically Unique content Compression (SUC). Specifically, SCC enables the conversion of semantically common contents into a small-and-sweet neural representation, along with embeddings that can be conveyed as a bitstream. SUC is composed of invertible modules for removing intra-image redundancies. The feature level combination from SCC and SUC naturally forms the final image set. Experimental results demonstrate the robustness and generalization capability of HNR-ISC in terms of signal and perceptual quality for reconstruction and accuracy for the downstream analysis task.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"28-40"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GCCNet: A Novel Network Leveraging Gated Cross-Correlation for Multi-View Classification","authors":"Yuanpeng Zeng;Ru Zhang;Hao Zhang;Shaojie Qiao;Faliang Huang;Qing Tian;Yuzhong Peng","doi":"10.1109/TMM.2024.3521733","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521733","url":null,"abstract":"Multi-view learning is a machine learning paradigm that utilizes multiple feature sets or data sources to improve learning performance and generalization. However, existing multi-view learning methods often do not capture and utilize information from different views very well, especially when the relationships between views are complex and of varying quality. In this paper, we propose a novel multi-view learning framework for the multi-view classification task, called Gated Cross-Correlation Network (GCCNet), which addresses these challenges by integrating the three key operational levels in multi-view learning: representation, fusion, and decision. Specifically, GCCNet contains a novel component called the Multi-View Gated Information Distributor (MVGID) to enhance noise filtering and optimize the retention of critical information. In addition, GCCNet uses cross-correlation analysis to reveal dependencies and interactions between different views, as well as integrates an adaptive weighted joint decision strategy to mitigate the interference of low-quality views. Thus, GCCNet can not only comprehensively capture and utilize information from different views, but also facilitate information exchange and synergy between views, ultimately improving the overall performance of the model. Extensive experimental results on ten benchmark datasets show GCCNet's outperforms state-of-the-art methods on eight out of ten datasets, validating its effectiveness and superiority in multi-view learning.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1086-1099"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhen Lan;Zixing Li;Chao Yan;Xiaojia Xiang;Dengqing Tang;Han Zhou;Jun Lai
{"title":"Adaptive Knowledge Distillation With Attention-Based Multi-Modal Fusion for Robust Dim Object Detection","authors":"Zhen Lan;Zixing Li;Chao Yan;Xiaojia Xiang;Dengqing Tang;Han Zhou;Jun Lai","doi":"10.1109/TMM.2024.3521793","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521793","url":null,"abstract":"Automated object detection in aerial images is crucial in both civil and military applications. Existing computer vision-based object detection methods are not robust enough to precisely detect dim objects in aerial images due to the cluttered backgrounds, various observing angles, small object scales, and severe occlusions. Recently, electroencephalography (EEG)-based object detection methods have received increasing attention owing to the advanced cognitive capabilities of human vision. However, how to combine the human intelligence with computer intelligence to achieve robust dim object detection is still an open question. In this paper, we propose a novel approach to efficiently fuse and exploit the properties of multi-modal data for dim object detection. Specifically, we first design a brain-computer interface (BCI) paradigm called eye-tracking-based slow serial visual presentation (ESSVP) to simultaneously collect the paired EEG and image data when subjects search for the dim objects in aerial images. Then, we develop an attention-based multi-modal fusion network to selectively aggregate the learned features of EEG and image modalities. Furthermore, we propose an adaptive multi-teacher knowledge distillation method to efficiently train the multi-modal dim object detector for better performance. To evaluate the effectiveness of our method, we conduct extensive experiments on the collected dataset in subject-dependent and subject-independent tasks. The experimental results demonstrate that the proposed dim object detection method exhibits superior effectiveness and robustness compared to the baselines and the state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2083-2096"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lili Huang;Yiming Cao;Pengcheng Jia;Chenglong Li;Jin Tang;Chuanfu Li
{"title":"Knowledge-Guided Cross-Modal Alignment and Progressive Fusion for Chest X-Ray Report Generation","authors":"Lili Huang;Yiming Cao;Pengcheng Jia;Chenglong Li;Jin Tang;Chuanfu Li","doi":"10.1109/TMM.2024.3521728","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521728","url":null,"abstract":"The task of chest X-ray report generation, which aims to simulate the diagnosis process of doctors, has received widespread attention. Compared with the image caption task, chest X-ray report generation is more challenging since it needs to generate a longer and more accurate description of each diagnostic part in chest X-ray images. Most of existing works focus on how to extract better visual features or more accurate text expression based on existing reports. However, they ignore the interactions between visual and text modalities and are thus obviously not in line with human thinking. A small part of works explore the interactions of visual and text modalities, but data-driven learning of cross-modal information mapping can not break the semantic gap between different modalities. In this work, we propose a novel approach called Knowledge-guided Cross-modal Alignment and Progressive fusion (KCAP), which takes the knowledge words from a created medical knowledge dictionary as the bridge to guide the cross-modal feature alignment and fusion, for accurate chest X-ray report generation. In particular, we create the medical knowledge dictionary by extracting medical phrases from the training set and then selecting some phrases with substantive meanings as knowledge words based on their frequency of occurrence. Based on the knowledge words from the medical knowledge dictionary, the visual and text modalities are interacted by a mapping layer for the enhancement of the features of two modalities, and then the alignment fusion module is introduced to mitigate the semantic gap between visual and text modalities. To retain the important details of the original information, we design a progressive fusion scheme to integrate the advantages of both salient fused and original features to generate better medical reports. The experimental results on IU-Xray and MIMIC datasets demonstrate the effectiveness of the proposed KCAP.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"557-567"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SQL-Net: Semantic Query Learning for Point-Supervised Temporal Action Localization","authors":"Yu Wang;Shengjie Zhao;Shiwei Chen","doi":"10.1109/TMM.2024.3521799","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521799","url":null,"abstract":"Point-supervised Temporal Action Localization (PS-TAL) detects temporal intervals of actions in untrimmed videos with a label-efficient paradigm. However, most existing methods fail to learn action completeness without instance-level annotations, resulting in fragmentary region predictions. In fact, the semantic information of snippets is crucial for detecting complete actions, meaning that snippets with similar representations should be considered as the same action category. To address this issue, we propose a novel representation refinement framework with a semantic query mechanism to enhance the discriminability of snippet-level features. Concretely, we set a group of learnable queries, each representing a specific action category, and dynamically update them based on the video context. With the assistance of these queries, we expect to search for the optimal action sequence that agrees with their semantics. Besides, we leverage some reliable proposals as pseudo labels and design a refinement and completeness module to refine temporal boundaries further, so that the completeness of action instances is captured. Finally, we demonstrate the superiority of the proposed method over existing state-of-the-art approaches on THUMOS14 and ActivityNet13 benchmarks. Notably, thanks to completeness learning, our algorithm achieves significant improvements under more stringent evaluation metrics.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"84-94"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question Answering","authors":"Dizhan Xue;Shengsheng Qian;Quan Fang;Changsheng Xu","doi":"10.1109/TMM.2024.3521709","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521709","url":null,"abstract":"Explanatory Visual Question Answering (EVQA) is a recently proposed multimodal reasoning task consisting of answering the visual question and generating multimodal explanations for the reasoning processes. Unlike traditional Visual Question Answering (VQA) task that only aims at predicting answers for visual questions, EVQA also aims to generate user-friendly explanations to improve the explainability and credibility of reasoning models. To date, existing methods for VQA and EVQA ignore the prompt in the question and enforce the model to predict the probabilities of all answers. Moreover, existing EVQA methods ignore the complex relationships among question words, visual regions, and explanation tokens. Therefore, in this work, we propose a Logic Integrated Neural Inference Network (LININ) to restrict the range of candidate answers based on first-order-logic (FOL) and capture cross-modal relationships to generate rational explanations. Firstly, we design a FOL-based question analysis program to fetch a small number of candidate answers. Secondly, we utilize a multimodal transformer encoder to extract visual and question features, and conduct the prediction on candidate answers. Finally, we design a multimodal explanation transformer to construct cross-modal relationships and generate rational explanations. Comprehensive experiments on benchmark datasets demonstrate the superiority of LININ compared with the state-of-the-art methods for EVQA.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"16-27"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}