{"title":"Adaptive Multi-Scale Language Reinforcement for Multimodal Named Entity Recognition","authors":"Enping Li;Tianrui Li;Huaishao Luo;Jielei Chu;Lixin Duan;Fengmao Lv","doi":"10.1109/TMM.2025.3543105","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543105","url":null,"abstract":"Over the recent years, multimodal named entity recognition has gained increasing attentions due to its wide applications in social media. The key factor of multimodal named entity recognition is to effectively fuse information of different modalities. Existing works mainly focus on reinforcing textual representations by fusing image features via the cross-modal attention mechanism. However, these works are limited in reinforcing the text modality at the token level. As a named entity usually contains several tokens, modeling token-level inter-modal interactions is suboptimal for the multimodal named entity recognition problem. In this work, we propose a multimodal named entity recognition approach dubbed Adaptive Multi-scale Language Reinforcement (AMLR) to implement entity-level language reinforcement. To this end, our model first expands token-level textual representations into multi-scale textual representations which are composed of language units of different lengths. After that, the visual information reinforces the language modality by modeling the cross-modal attention between images and expanded multi-scale textual representations. Unlike existing token-level language reinforcement methods, the word sequences of named entities can be directly interacted with the visual features as a whole, making the modeled cross-modal correlations more reasonable. Although the underlying entity is not given, the training procedure can encourage the relevant image contents to adaptively attend to the appropriate language units, making our approach not rely on the pipeline design. Comprehensive evaluation results on two public Twitter datasets clearly demonstrate the superiority of our proposed model.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5312-5323"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SyNet: A Synergistic Network for 3D Object Detection Through Geometric-Semantic-Based Multi-Interaction Fusion","authors":"Xiaoqin Zhang;Kenan Bi;Sixian Chan;Shijian Lu;Xiaolong Zhou","doi":"10.1109/TMM.2025.3542993","DOIUrl":"https://doi.org/10.1109/TMM.2025.3542993","url":null,"abstract":"Driven by rising demands in autonomous driving, robotics, <italic>etc.</i>, 3D object detection has recently achieved great advancement by fusing optical images and LiDAR point data. On the other hand, most existing optical-LiDAR fusion methods straightly overlay RGB images and point clouds without adequately exploiting the synergy between them, leading to suboptimal fusion and 3D detection performance. Additionally, they often suffer from limited localization accuracy without proper balancing of global and local object information. To address this issue, we design a synergistic network (SyNet) that fuses geometric information, semantic information, as well as global and local information of objects for robust and accurate 3D detection. The SyNet captures synergies between optical images and LiDAR point clouds from three perspectives. The first is geometric, which derives high-quality depth by projecting point clouds onto multi-view images, enriching optical RGB images with 3D spatial information for a more accurate interpretation of image semantics. The second is semantic, which voxelizes point clouds and establishes correspondences between the derived voxels and image pixels, enriching 3D point clouds with semantic information for more accurate 3D detection. The third is balancing local and global object information, which introduces deformable self-attention and cross-attention to process the two types of complementary information in parallel for more accurate object localization. Extensive experiments show that SyNet achieves 70.7% mAP and 73.5% NDS on the nuScenes test set, demonstrating its effectiveness and superiority as compared with the state-of-the-art.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4950-4960"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yawen Cui;Jian Zhao;Zitong Yu;Rizhao Cai;Xun Wang;Lei Jin;Alex C. Kot;Li Liu;Xuelong Li
{"title":"CMoA: Contrastive Mixture of Adapters for Generalized Few-Shot Continual Learning","authors":"Yawen Cui;Jian Zhao;Zitong Yu;Rizhao Cai;Xun Wang;Lei Jin;Alex C. Kot;Li Liu;Xuelong Li","doi":"10.1109/TMM.2025.3543038","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543038","url":null,"abstract":"The goal of Few-Shot Continual Learning (FSCL) is to incrementally learn novel tasks with limited labeled samples and preserve previous capabilities simultaneously. However, current FSCL works lack research on domain increment and domain generalization ability, which cannot cope with changes in the visual perception environment. In this paper, we set up a Generalized FSCL (GFSCL) protocol involving both class- and domain-incremental scenarios together with domain generalization assessment. Firstly, two benchmark datasets and protocols are newly arranged, and detailed baselines are provided for this unexplored configuration. Furthermore, we find that common continual learning methods have poor generalization ability on unseen domains and cannot better tackle catastrophic forgetting issue in cross-incremental tasks. Hence, we propose a rehearsal-free framework based on Vision Transformer (ViT) named Contrastive Mixture of Adapters (CMoA). It contains two non-conflicting parts: (1) By applying the fast-adaptation characteristic of adapter-embedded ViT, the mixture of Adapters (MoA) module is incorporated into ViT. For stability purpose, cosine similarity regularization and dynamic weighting are designed to make each adapter learn specific knowledge and concentrate on particular classes. (2) To further enhance domain generalization ability, we alleviate the intra-class variation by prototype-calibrated contrastive learning to improve domain-invariant representation learning. Finally, six evaluation indicators showing the overall performance and forgetting are compared by comprehensive experiments on two benchmark datasets to validate the efficacy of CMoA, and the results illustrate that CMoA can achieve comparative performance with rehearsal-based continual learning methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5533-5547"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TBag: Three Recipes for Building up a Lightweight Hybrid Network for Real-Time SISR","authors":"Ruoyi Xue;Cheng Cheng;Hang Wang;Hongbin Sun","doi":"10.1109/TMM.2025.3542966","DOIUrl":"https://doi.org/10.1109/TMM.2025.3542966","url":null,"abstract":"The prevalent convolution neural network (CNN) and Transformer have revolutionized the area of single-image super-resolution (SISR). Though these models have significantly improved performance, they often struggle with real-time applications or on resource-constrained platforms due to their complexity. In this paper, we propose TBag, a lightweight hybrid network that combines the strengths of CNN and Transformer to address these challenges. Our method simplifies the Transformer block with three key optimizations: 1) No projection layer is applied to the value in the original self-attention operation; 2) The number of tokens is rescaled before the self-attention operation and then rescaled back for easing of computation; 3) The expansion factor of the original feed-forward network (FFN) is adjusted. These optimizations enable the development of an efficient hybrid network tailored for real-time SISR. Notably, the hybrid design of CNN and Transformer further enhances both local detail recovery and global feature modeling. Extensive experiments show that TBag achieves a competitive trade-off between effectiveness and efficiency compared to previous lightweight SISR methods (e.g., <bold>+0.42 dB</b> PSNR with an <bold>86.7%</b> reduction in latency). Moreover, TBag's real-time capabilities make it highly suitable for practical applications, with the TBag-Tiny version achieving up to <bold>59 FPS</b> on hardware devices. Future work will explore the potential of this hybrid approach in other image restoration tasks, such as denoising and deblurring.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5363-5375"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junjie Shi;Puhong Duan;Xiaoguang Ma;Jianning Chi;Yong Dai
{"title":"Frefusion: Frequency Domain Transformer for Infrared and Visible Image Fusion","authors":"Junjie Shi;Puhong Duan;Xiaoguang Ma;Jianning Chi;Yong Dai","doi":"10.1109/TMM.2025.3543019","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543019","url":null,"abstract":"Visible and infrared image fusion(VIF) provides more comprehensive understanding of a scene and can facilitate subsequent processing. Although frequency domain contains valuable global information in low frequency and rapid pixel intensity variation data in high frequency of images, existing fusion methods mainly focus on spatial domain. To close this gap, a novel VIF method in frequency domain is proposed. First, a frequency-domain feature extraction module is developed for source images. Then, a frequency-domain transformer fusion method is designed to merge the extracted features. Finally, a residual reconstruction module is introduced to obtain final fused images. To the best of our knowledge, it is the first time that image fusion study is conducted from frequency domain perspective. Comprehensive experiments on three datasets, i.e., MSRS, TNO, and Roadscene, demonstrate that the proposed approach obtains superior fusion performance over several state-of-the-art fusion methods, indicating its great potential as a generic backbone for VIF tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5722-5730"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Deep Semantic Segmentation Network With Semantic and Contextual Refinements","authors":"Zhiyan Wang;Deyin Liu;Lin Yuanbo Wu;Song Wang;Xin Guo;Lin Qi","doi":"10.1109/TMM.2025.3543037","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543037","url":null,"abstract":"Semantic segmentation is a fundamental task in multimedia processing, which can be used for analyzing, understanding, editing contents of images and videos, among others. To accelerate the analysis of multimedia data, existing segmentation researches tend to extract semantic information by progressively reducing the spatial resolutions of feature maps. However, this approach introduces a misalignment problem when restoring the resolution of high-level feature maps. In this paper, we design a Semantic Refinement Module (SRM) to address this issue within the segmentation network. Specifically, SRM is designed to learn a transformation offset for each pixel in the upsampled feature maps, guided by high-resolution feature maps and neighboring offsets. By applying these offsets to the upsampled feature maps, SRM enhances the semantic representation of the segmentation network, particularly for pixels around object boundaries. Furthermore, a Contextual Refinement Module (CRM) is presented to capture global context information across both spatial and channel dimensions. To balance dimensions between channel and space, we aggregate the semantic maps from all four stages of the backbone to enrich channel context information. The efficacy of these proposed modules is validated on three widely used datasets—Cityscapes, Bdd100 K, and ADE20K—demonstrating superior performance compared to state-of-the-art methods. Additionally, this paper extends these modules to a lightweight segmentation network, achieving an mIoU of 82.5% on the Cityscapes validation set with only 137.9 GFLOPs.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4856-4868"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Preemptive Defense Algorithm Based on Generalizable Black-Box Feedback Regulation Strategy Against Face-Swapping Deepfake Models","authors":"Zhongjie Mi;Xinghao Jiang;Tanfeng Sun;Ke Xu;Qiang Xu","doi":"10.1109/TMM.2025.3543059","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543059","url":null,"abstract":"In the previous efforts to counteract Deepfake, detection methods were most adopted, but they could only function after-effect and could not undo the harm. Preemptive defense has recently gained attention as an alternative, but such defense works have either limited their scenario to facial-reenactment Deepfake models or only targeted specific face-swapping Deepfake model. Motivated to fill this gap, we start by establishing the Deepfake scenario modeling and finding the scenario difference among categories, then move on to the face-swapping scenario setting overlooked by previous works. Based on this scenario, we first propose a novel Black-Box Penetrating Defense Process that enables defense against face-swapping models without prior model knowledge. Then we propose a novel Double-Blind Feedback Regulation Strategy to solve the reality problem of avoiding alarming distortions after defense that had previously been ignored, which helps conduct valid preemptive defense against face-swapping Deepfake models in reality. Experimental results in comparison with state-of-the-art defense methods are conducted against popular face-swapping Deepfake models, proving our proposed method valid under practical circumstances.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4780-4794"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144750925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Zhang;Haoran Ning;Jiaxin Tang;Zhenxiang Chen;Yaping Zhong;Yahong Han
{"title":"WiViPose: A Video-Aided Wi-Fi Framework for Environment-Independent 3D Human Pose Estimation","authors":"Lei Zhang;Haoran Ning;Jiaxin Tang;Zhenxiang Chen;Yaping Zhong;Yahong Han","doi":"10.1109/TMM.2025.3543090","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543090","url":null,"abstract":"The inherent complexity of Wi-Fi signals makes video-aided Wi-Fi 3D pose estimation difficult. The challenges include the limited generalizability of the task across diverse environments, its significant signal heterogeneity, and its inadequate ability to analyze local and geometric information. To overcome these challenges, we introduce WiViPose, a video-aided Wi-Fi framework for 3D pose estimation, which attains enhanced cross-environment generalization through cross-layer optimization. Bilinear temporal-spectral fusion (BTSF) is initially used to fuse the time-domain and frequency-domain features derived from Wi-Fi. Video features are derived from a multiresolution convolutional pose machine and enhanced by local self-attention. Cross-modality data fusion is facilitated through an attention-based transformer, with the process further refined under a supervisory mechanism. WiViPose demonstrates effectiveness by achieving an average percentage of correct keypoints (PCK)@50 of 91.01% across three typical indoor environments.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5225-5240"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deformable Cross-Attention Transformer for Weakly Aligned RGB–T Pedestrian Detection","authors":"Yu Hu;Xiaobo Chen;Sheng Wang;Luyang Liu;Hengyang Shi;Lihong Fan;Jing Tian;Jun Liang","doi":"10.1109/TMM.2025.3543056","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543056","url":null,"abstract":"Pedestrian detection plays a crucial role in autonomous driving systems. To ensure reliable and effective detection in challenging conditions, researchers have proposed RGB–T (RGB–thermal) detectors that integrate thermal images with color images for more complementary feature representations. However, existing methods face challenges in capturing the spatial and geometric correlations between different modalities, as well as in assuming perfect synchronization of the two modalities, which is unrealistic in real-world scenarios. In response to these challenges, we present a new deformable-attention-based approach for weakly aligned RGB–T pedestrian detection. The proposed method uses a dual-branch cross-attention mechanism to capture the inherent spatial and geometric correlations between color and thermal images. Furthermore, it incorporates positional information for each image pixel into the sampling offset generation to enhance robustness in scenarios where modalities are not precisely aligned or registered. To reduce computational complexity, we introduce a local attention mechanism that samples only a small set of keys and values within a limited region in the feature maps for each query. Extensive experiments and ablation studies conducted on multiple public datasets confirm the effectiveness of the proposed framework.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4400-4411"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144750885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bi-Directional Deep Contextual Video Compression","authors":"Xihua Sheng;Li Li;Dong Liu;Shiqi Wang","doi":"10.1109/TMM.2025.3543061","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543061","url":null,"abstract":"Deep video compression has made impressive process in recent years, with the majority of advancements concentrated on P-frame coding. Although efforts to enhance B-frame coding are ongoing, their compression performance is still far behind that of traditional bi-directional video codecs. In this article, we introduce a bi-directional deep contextual video compression scheme tailored for B-frames, termed DCVC-B, to improve the compression performance of deep B-frame coding. Our scheme mainly has three key innovations. First, we develop a bi-directional motion difference context propagation method for effective motion difference coding, which significantly reduces the bit cost of bi-directional motions. Second, we propose a bi-directional contextual compression model and a corresponding bi-directional temporal entropy model, to make better use of the multi-scale temporal contexts. Third, we propose a hierarchical quality structure-based training strategy, leading to an effective bit allocation across large groups of pictures (GOP). Experimental results show that our DCVC-B achieves an average reduction of 26.6% in BD-Rate compared to the reference software for H.265/HEVC under random access conditions. Remarkably, it surpasses the performance of the H.266/VVC reference software on certain test datasets under the same configuration. We anticipate our work can provide valuable insights and bring up deep B-frame coding to the next level.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5632-5646"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}