{"title":"TBag: Three Recipes for Building up a Lightweight Hybrid Network for Real-Time SISR","authors":"Ruoyi Xue;Cheng Cheng;Hang Wang;Hongbin Sun","doi":"10.1109/TMM.2025.3542966","DOIUrl":"https://doi.org/10.1109/TMM.2025.3542966","url":null,"abstract":"The prevalent convolution neural network (CNN) and Transformer have revolutionized the area of single-image super-resolution (SISR). Though these models have significantly improved performance, they often struggle with real-time applications or on resource-constrained platforms due to their complexity. In this paper, we propose TBag, a lightweight hybrid network that combines the strengths of CNN and Transformer to address these challenges. Our method simplifies the Transformer block with three key optimizations: 1) No projection layer is applied to the value in the original self-attention operation; 2) The number of tokens is rescaled before the self-attention operation and then rescaled back for easing of computation; 3) The expansion factor of the original feed-forward network (FFN) is adjusted. These optimizations enable the development of an efficient hybrid network tailored for real-time SISR. Notably, the hybrid design of CNN and Transformer further enhances both local detail recovery and global feature modeling. Extensive experiments show that TBag achieves a competitive trade-off between effectiveness and efficiency compared to previous lightweight SISR methods (e.g., <bold>+0.42 dB</b> PSNR with an <bold>86.7%</b> reduction in latency). Moreover, TBag's real-time capabilities make it highly suitable for practical applications, with the TBag-Tiny version achieving up to <bold>59 FPS</b> on hardware devices. Future work will explore the potential of this hybrid approach in other image restoration tasks, such as denoising and deblurring.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5363-5375"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junjie Shi;Puhong Duan;Xiaoguang Ma;Jianning Chi;Yong Dai
{"title":"Frefusion: Frequency Domain Transformer for Infrared and Visible Image Fusion","authors":"Junjie Shi;Puhong Duan;Xiaoguang Ma;Jianning Chi;Yong Dai","doi":"10.1109/TMM.2025.3543019","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543019","url":null,"abstract":"Visible and infrared image fusion(VIF) provides more comprehensive understanding of a scene and can facilitate subsequent processing. Although frequency domain contains valuable global information in low frequency and rapid pixel intensity variation data in high frequency of images, existing fusion methods mainly focus on spatial domain. To close this gap, a novel VIF method in frequency domain is proposed. First, a frequency-domain feature extraction module is developed for source images. Then, a frequency-domain transformer fusion method is designed to merge the extracted features. Finally, a residual reconstruction module is introduced to obtain final fused images. To the best of our knowledge, it is the first time that image fusion study is conducted from frequency domain perspective. Comprehensive experiments on three datasets, i.e., MSRS, TNO, and Roadscene, demonstrate that the proposed approach obtains superior fusion performance over several state-of-the-art fusion methods, indicating its great potential as a generic backbone for VIF tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5722-5730"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Deep Semantic Segmentation Network With Semantic and Contextual Refinements","authors":"Zhiyan Wang;Deyin Liu;Lin Yuanbo Wu;Song Wang;Xin Guo;Lin Qi","doi":"10.1109/TMM.2025.3543037","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543037","url":null,"abstract":"Semantic segmentation is a fundamental task in multimedia processing, which can be used for analyzing, understanding, editing contents of images and videos, among others. To accelerate the analysis of multimedia data, existing segmentation researches tend to extract semantic information by progressively reducing the spatial resolutions of feature maps. However, this approach introduces a misalignment problem when restoring the resolution of high-level feature maps. In this paper, we design a Semantic Refinement Module (SRM) to address this issue within the segmentation network. Specifically, SRM is designed to learn a transformation offset for each pixel in the upsampled feature maps, guided by high-resolution feature maps and neighboring offsets. By applying these offsets to the upsampled feature maps, SRM enhances the semantic representation of the segmentation network, particularly for pixels around object boundaries. Furthermore, a Contextual Refinement Module (CRM) is presented to capture global context information across both spatial and channel dimensions. To balance dimensions between channel and space, we aggregate the semantic maps from all four stages of the backbone to enrich channel context information. The efficacy of these proposed modules is validated on three widely used datasets—Cityscapes, Bdd100 K, and ADE20K—demonstrating superior performance compared to state-of-the-art methods. Additionally, this paper extends these modules to a lightweight segmentation network, achieving an mIoU of 82.5% on the Cityscapes validation set with only 137.9 GFLOPs.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4856-4868"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Preemptive Defense Algorithm Based on Generalizable Black-Box Feedback Regulation Strategy Against Face-Swapping Deepfake Models","authors":"Zhongjie Mi;Xinghao Jiang;Tanfeng Sun;Ke Xu;Qiang Xu","doi":"10.1109/TMM.2025.3543059","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543059","url":null,"abstract":"In the previous efforts to counteract Deepfake, detection methods were most adopted, but they could only function after-effect and could not undo the harm. Preemptive defense has recently gained attention as an alternative, but such defense works have either limited their scenario to facial-reenactment Deepfake models or only targeted specific face-swapping Deepfake model. Motivated to fill this gap, we start by establishing the Deepfake scenario modeling and finding the scenario difference among categories, then move on to the face-swapping scenario setting overlooked by previous works. Based on this scenario, we first propose a novel Black-Box Penetrating Defense Process that enables defense against face-swapping models without prior model knowledge. Then we propose a novel Double-Blind Feedback Regulation Strategy to solve the reality problem of avoiding alarming distortions after defense that had previously been ignored, which helps conduct valid preemptive defense against face-swapping Deepfake models in reality. Experimental results in comparison with state-of-the-art defense methods are conducted against popular face-swapping Deepfake models, proving our proposed method valid under practical circumstances.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4780-4794"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144750925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Zhang;Haoran Ning;Jiaxin Tang;Zhenxiang Chen;Yaping Zhong;Yahong Han
{"title":"WiViPose: A Video-Aided Wi-Fi Framework for Environment-Independent 3D Human Pose Estimation","authors":"Lei Zhang;Haoran Ning;Jiaxin Tang;Zhenxiang Chen;Yaping Zhong;Yahong Han","doi":"10.1109/TMM.2025.3543090","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543090","url":null,"abstract":"The inherent complexity of Wi-Fi signals makes video-aided Wi-Fi 3D pose estimation difficult. The challenges include the limited generalizability of the task across diverse environments, its significant signal heterogeneity, and its inadequate ability to analyze local and geometric information. To overcome these challenges, we introduce WiViPose, a video-aided Wi-Fi framework for 3D pose estimation, which attains enhanced cross-environment generalization through cross-layer optimization. Bilinear temporal-spectral fusion (BTSF) is initially used to fuse the time-domain and frequency-domain features derived from Wi-Fi. Video features are derived from a multiresolution convolutional pose machine and enhanced by local self-attention. Cross-modality data fusion is facilitated through an attention-based transformer, with the process further refined under a supervisory mechanism. WiViPose demonstrates effectiveness by achieving an average percentage of correct keypoints (PCK)@50 of 91.01% across three typical indoor environments.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5225-5240"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deformable Cross-Attention Transformer for Weakly Aligned RGB–T Pedestrian Detection","authors":"Yu Hu;Xiaobo Chen;Sheng Wang;Luyang Liu;Hengyang Shi;Lihong Fan;Jing Tian;Jun Liang","doi":"10.1109/TMM.2025.3543056","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543056","url":null,"abstract":"Pedestrian detection plays a crucial role in autonomous driving systems. To ensure reliable and effective detection in challenging conditions, researchers have proposed RGB–T (RGB–thermal) detectors that integrate thermal images with color images for more complementary feature representations. However, existing methods face challenges in capturing the spatial and geometric correlations between different modalities, as well as in assuming perfect synchronization of the two modalities, which is unrealistic in real-world scenarios. In response to these challenges, we present a new deformable-attention-based approach for weakly aligned RGB–T pedestrian detection. The proposed method uses a dual-branch cross-attention mechanism to capture the inherent spatial and geometric correlations between color and thermal images. Furthermore, it incorporates positional information for each image pixel into the sampling offset generation to enhance robustness in scenarios where modalities are not precisely aligned or registered. To reduce computational complexity, we introduce a local attention mechanism that samples only a small set of keys and values within a limited region in the feature maps for each query. Extensive experiments and ablation studies conducted on multiple public datasets confirm the effectiveness of the proposed framework.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4400-4411"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144750885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bi-Directional Deep Contextual Video Compression","authors":"Xihua Sheng;Li Li;Dong Liu;Shiqi Wang","doi":"10.1109/TMM.2025.3543061","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543061","url":null,"abstract":"Deep video compression has made impressive process in recent years, with the majority of advancements concentrated on P-frame coding. Although efforts to enhance B-frame coding are ongoing, their compression performance is still far behind that of traditional bi-directional video codecs. In this article, we introduce a bi-directional deep contextual video compression scheme tailored for B-frames, termed DCVC-B, to improve the compression performance of deep B-frame coding. Our scheme mainly has three key innovations. First, we develop a bi-directional motion difference context propagation method for effective motion difference coding, which significantly reduces the bit cost of bi-directional motions. Second, we propose a bi-directional contextual compression model and a corresponding bi-directional temporal entropy model, to make better use of the multi-scale temporal contexts. Third, we propose a hierarchical quality structure-based training strategy, leading to an effective bit allocation across large groups of pictures (GOP). Experimental results show that our DCVC-B achieves an average reduction of 26.6% in BD-Rate compared to the reference software for H.265/HEVC under random access conditions. Remarkably, it surpasses the performance of the H.266/VVC reference software on certain test datasets under the same configuration. We anticipate our work can provide valuable insights and bring up deep B-frame coding to the next level.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5632-5646"},"PeriodicalIF":9.7,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rain2Avoid: Learning Deraining by Self-Supervision","authors":"Yan-Tsung Peng;Wei-Hua Li;Zihao Chen","doi":"10.1109/TMM.2025.3542981","DOIUrl":"https://doi.org/10.1109/TMM.2025.3542981","url":null,"abstract":"Images captured on rainy days often contain rain streaks that can obscure important scenery and degrade the performance of high-level vision tasks, such as image segmentation in autonomous vehicles. As a result, image deraining, a low-level vision task focused on removing rain streaks from images, has gained popularity over the past decade. Recent advancements have primarily concentrated on supervised image deraining methods, which rely on paired rain-clean image datasets to train deep neural network models. However, collecting such paired real data is challenging and time-consuming. To address this, our method introduces a novel self-supervised approach that leverages the proposed locally dominant gradient prior and non-local self-similarity stochastic sampling. This approach extracts potential rain streaks and generates stochastic derained references for image deraining. Experimental results on public benchmark image-deraining datasets show that our proposed method performs favorably against state-of-the-art few-shot and self-supervised image deraining methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4765-4779"},"PeriodicalIF":9.7,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple Adaptation Network for Multi-Source and Multi-Target Domain Adaptation","authors":"Yuwu Lu;Haoyu Huang;Xue Hu;Zhihui Lai","doi":"10.1109/TMM.2025.3543094","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543094","url":null,"abstract":"Multi-source domain adaptation (MSDA) has garnered significant attention due to its emphasis on transferring knowledge from multiple labeled source domains to a single unlabeled target domain. MSDA requires sufficient labeled data from multiple source domains, but in practice, massive unlabeled data exist instead of well-labeled data. Multiple target domains also provide plenty of information, which is useful for domain adaptation. However, most MSDA studies overlook the critical scenario of multi-source and multi-target domain adaptation (MMDA). To address these problems, we propose a Multiple Adaptation Network (MAN) approach for MMDA, which utilizes multiple alignment strategies for each source-target domain pair-group to align relevant specific feature spaces. MAN also aligns multiple classifiers for the relevant feature spaces to optimize the decision boundaries of multiple target domains. Moreover, to consider the task relations of multiple classifiers, we minimize the semantic differences between the target-conditioned classifiers and utilize a weight learning category to optimize this process. To fully utilize the information from multiple target domains, we transfer the style information of the target data to the source data, aiding in the training of multiple classifiers. Extensive experiments in challenge domain adaptation benchmarks, including the ImageCLEF-DA, Office-Home, DomainNet, and RGB-to-thermal datasets, demonstrate the superiority of our method over the state-of-the-art approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5731-5745"},"PeriodicalIF":9.7,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abrham Shiferaw Alemaw;Giulia Slavic;Pamela Zontone;Lucio Marcenaro;David Martin Gomez;Carlo Regazzoni
{"title":"Modeling Interactions Between Autonomous Agents in a Multi-Agent Self-Awareness Architecture","authors":"Abrham Shiferaw Alemaw;Giulia Slavic;Pamela Zontone;Lucio Marcenaro;David Martin Gomez;Carlo Regazzoni","doi":"10.1109/TMM.2025.3543110","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543110","url":null,"abstract":"Learning from experience is a fundamental capability of intelligent agents. Autonomous systems rely on sensors that provide data about the environment and internal situations to their perception systems for learning and inference mechanisms. These systems can also learn Self-Aware and Situation-Aware generative modules from these data to localize themselves and interact with the environment. In this paper, we propose a self-aware cognitive architecture capable to perform tasks where the interactions between the self-state of an agent and the surrounding environment are explicitly and dynamically represented. We specifically develop a Deep Learning (DL) based Self-Aware interaction model, empowered by learning from Multi-Modal Perception (MMP) and World Models using multi-sensory data in a novel Multi-Agent Self-Awareness Architecture (MASAA). Two sub-modules are developed, the Situation Model (SM) and the First-Person model (FPM), that address different and interrelated aspects of the World Model (WM). The MMP model, instead, aims at learning the mapping of different sensory perceptions into Exteroceptive (EI) and Proprioceptive (PI) latent information. The WM then uses the learned MMP model as experience to predict dynamic self-behaviors and interaction patterns within the experienced environment. WM and MMP Models are learned in a data-driven way, starting from the lower-dimensional odometry data used to guide the learning of higher-dimensional video data, thus generating coupled Generalized State Hierarchical Dynamic Bayesian Networks (GS-HDBNs). We test our model on KITTI, CARLA, and iCab datasets, achieving high performance and a low average localization error (RMSE) of 2.897%, when considering two interacting agents.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5035-5049"},"PeriodicalIF":9.7,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}