Weida Chen;Jie Jiang;Linfei Wang;Huafeng Li;Yibing Zhan;Dapeng Tao
{"title":"Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter","authors":"Weida Chen;Jie Jiang;Linfei Wang;Huafeng Li;Yibing Zhan;Dapeng Tao","doi":"10.1109/TMM.2024.3521756","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521756","url":null,"abstract":"Recently, weakly supervised methods for scene text spotter are increasingly popular with researchers due to their potential to significantly reduce dataset annotation efforts. The latest progress in this field is text spotter based on single or multi-point annotations. However, this method struggles with the sensitivity of text recognition to the precise annotation location and fails to capture the relative positions and shapes of characters, leading to impaired recognition of texts with extensive rotations and flips. To address these challenges, this paper develops a novel method named Coarse-point-supervised Scene Text Spotter (Cps-STS). Cps-STS first utilizes a few approximate points as text location labels and introduces a learnable position modulation mechanism, easing the accuracy requirements for annotations and enhancing model robustness. Additionally, we incorporate a Spatial Compatibility Attention (SCA) module for text decoding to effectively utilize spatial data such as position and shape. This module fuses compound queries and global feature maps, serving as a bias in the SCA module to express text spatial morphology. In order to accurately locate and decode text content, we introduce features containing spatial morphology information and text content into the input features of the text decoder. By introducing features with spatial morphology information as bias terms into the text decoder, ablation experiments demonstrate that this operation enables the model to effectively identify and utilize the relationship between text content and position to enhance the recognition performance of our model. One significant advantage of Cps-STS is its ability to achieve full supervision-level performance with just a few imprecise coarse points at a low cost. Extensive experiments validate the effectiveness and superiority of Cps-STS over existing approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1652-1664"},"PeriodicalIF":8.4,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SSFam: Scribble Supervised Salient Object Detection Family","authors":"Zhengyi Liu;Sheng Deng;Xinrui Wang;Linbo Wang;Xianyong Fang;Bin Tang","doi":"10.1109/TMM.2025.3543092","DOIUrl":"https://doi.org/10.1109/TMM.2025.3543092","url":null,"abstract":"Scribble supervised salient object detection (SSSOD) constructs segmentation ability of attractive objects from surroundings under the supervision of sparse scribble labels. For the better segmentation, depth and thermal infrared modalities serve as the supplement to RGB images in the complex scenes. Existing methods specifically design various feature extraction and multi-modal fusion strategies for RGB, RGB-Depth, RGB-Thermal, and Visual-Depth-Thermal image input respectively, leading to similar model flood. As the recently proposed Segment Anything Model (SAM) possesses extraordinary segmentation and prompt interactive capability, we propose an SSSOD family based on SAM, named <italic>SSFam</i>, for the combination input with different modalities. Firstly, different modal-aware modulators are designed to attain modal-specific knowledge which cooperates with modal-agnostic information extracted from the frozen SAM encoder for the better feature ensemble. Secondly, a siamese decoder is tailored to bridge the gap between the training with scribble prompt and the testing with no prompt for the stronger decoding ability. Our model demonstrates the remarkable performance among combinations of different modalities and refreshes the highest level of scribble supervised methods and comes close to the ones of fully supervised methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1988-2000"},"PeriodicalIF":8.4,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Baoliang Chen;Hanwei Zhu;Lingyu Zhu;Shanshe Wang;Jingshan Pan;Shiqi Wang
{"title":"Debiased Mapping for Full-Reference Image Quality Assessment","authors":"Baoliang Chen;Hanwei Zhu;Lingyu Zhu;Shanshe Wang;Jingshan Pan;Shiqi Wang","doi":"10.1109/TMM.2025.3535280","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535280","url":null,"abstract":"An ideal full-reference image quality (FR-IQA) model should exhibit both high separability for images with different quality and compactness for images with the same or indistinguishable quality. However, existing learning-based FR-IQA models that directly compare images in deep-feature space, usually overly emphasize the quality separability, neglecting to maintain the compactness when images are of similar quality. In our work, we identify that the perception bias mainly stems from an inappropriate subspace where images are projected and compared. For this issue, we propose a Debiased Mapping based quality Measure (DMM), leveraging orthonormal bases formed by singular value decomposition (SVD) in the deep features domain. The SVD effectively decomposes the quality variations into singular values and mapping bases, enabling quality inference with more reliable feature difference measures. Extensive experimental results reveal that our proposed measure could mitigate the perception bias effectively and demonstrates excellent quality prediction performance on various IQA datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2638-2649"},"PeriodicalIF":8.4,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ETC: Temporal Boundary Expand Then Clarify for Weakly Supervised Video Grounding With Multimodal Large Language Model","authors":"Guozhang Li;Xinpeng Ding;De Cheng;Jie Li;Nannan Wang;Xinbo Gao","doi":"10.1109/TMM.2024.3521758","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521758","url":null,"abstract":"Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotations, explicit supervision methods (i.e., generating pseudo-temporal boundaries for training) have achieved great success. However, data augmentation in these methods might disrupt critical temporal information, yielding poor pseudo-temporal boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose <bold>ETC</b> (<bold>E</b>xpand <bold>t</b>hen <bold>C</b>larify), first using the additional information to expand the initial incomplete pseudo-temporal boundaries, and subsequently refining these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multi-modal large language models (MLLMs) to annotate each frame within the initial pseudo-temporal boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise in expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1772-1782"},"PeriodicalIF":8.4,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Neural Adaptive Wireless Video Streaming via Cross-Layer Information Exposure and Online Tuning","authors":"Lingzhi Zhao;Ying Cui;Yuhang Jia;Yunfei Zhang;Klara Nahrstedt","doi":"10.1109/TMM.2024.3521820","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521820","url":null,"abstract":"Deep reinforcement learning (DRL) demonstrates its promising potential in adaptive video streaming and has recently received increasing attention. However, existing DRL-based methods for adaptive video streaming mainly use application (APP) layer information, adopt heuristic training methods, and are not robust against continuous network fluctuations. This paper aims to boost the quality of experience (QoE) of adaptive wireless video streaming by using cross-layer information, deriving a rigorous training method, and adopting effective online tuning methods with real-time data. First, we formulate a more comprehensive and accurate adaptive wireless video streaming problem as an infinite stage discounted Markov decision process (MDP) problem by additionally incorporating past and lower-layer information. This formulation allows a flexible tradeoff between QoE and computational and memory costs for solving the problem. In the offline scenario (only with pre-collected data), we propose an enhanced asynchronous advantage actor-critic (eA3C) method by jointly optimizing the parameters of parameterized policy and value function. Specifically, we build an eA3C network consisting of a policy network and a value network that can utilize cross-layer, past, and current information and jointly train the eA3C network using pre-collected samples. In the online scenario (with additional real-time data), we propose two continual learning-based online tuning methods for designing better policies for a specific user with different QoE and training time tradeoffs. The proposed online tuning methods are robust against continuous network fluctuations and more general and flexible than the existing online tuning methods. Finally, experimental results show that the proposed offline policy can improve the QoE by 6.8% to 14.4% compared to the state-of-the-arts in the offline scenario, and the proposed online policies can achieve <inline-formula><tex-math>$6.3%$</tex-math></inline-formula> to 55.8% gains in QoE over the state-of-the-arts in the online scenario.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1289-1304"},"PeriodicalIF":8.4,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shishun Tian;Tiantian Zeng;Zhengyu Zhang;Wenbin Zou;Xia Li
{"title":"Dual Residual-Guided Interactive Learning for the Quality Assessment of Enhanced Images","authors":"Shishun Tian;Tiantian Zeng;Zhengyu Zhang;Wenbin Zou;Xia Li","doi":"10.1109/TMM.2024.3521734","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521734","url":null,"abstract":"Image enhancement algorithms can facilitate computer vision tasks in real applications. However, various distortions may also be introduced by image enhancement algorithms. Therefore, the image quality assessment (IQA) plays a crucial role in accurately evaluating enhanced images to provide dependable feedback. Current enhanced IQA methods are mainly designed for single specific scenarios, resulting in limited performance in other scenarios. Besides, no-reference methods predict quality utilizing enhanced images alone, which ignores the existing degraded images that contain valuable information, are not reliable enough. In this work, we propose a degraded-reference image quality assessment method based on dual residual-guided interactive learning (DRGQA) for the enhanced images in multiple scenarios. Specifically, a global and local feature collaboration module (GLCM) is proposed to imitate the perception of observers to capture comprehensive quality-aware features by using convolutional neural networks (CNN) and Transformers in an interactive manner. Then, we investigate the structure damage and color shift distortions that commonly occur in the enhanced images and propose a dual residual-guided module (DRGM) to make the model concentrate on the distorted regions that are sensitive to human visual system (HVS). Furthermore, a distortion-aware feature enhancement module (DEM) is proposed to improve the representation abilities of features in deeper networks. Extensive experimental results demonstrate that our proposed DRGQA achieves superior performance with lower computational complexity compared to the state-of-the-art IQA methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1637-1651"},"PeriodicalIF":8.4,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learned Focused Plenoptic Image Compression With Local-Global Correlation Learning","authors":"Gaosheng Liu;Huanjing Yue;Bihan Wen;Jingyu Yang","doi":"10.1109/TMM.2024.3521815","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521815","url":null,"abstract":"The dense light field sampling of focused plenoptic images (FPIs) yields substantial amounts of redundant data, necessitating efficient compression in practical applications. However, the presence of discontinuous structures and long-distance properties in FPIs poses a challenge. In this paper, we propose a novel end-to-end approach for learned focused plenoptic image compression (LFPIC). Specifically, we introduce a local-global correlation learning strategy to build the nonlinear transforms. This strategy can effectively handle the discontinuous structures and leverage long-distance correlations in FPI for high compression efficiency. Additionally, we propose a spatial-wise context model tailored for LFPIC to help emphasize the most related symbols during coding and further enhance the rate-distortion performance. Experimental results demonstrate the effectiveness of our proposed method, achieving a 22.16% BD-rate reduction (measured in PSNR) on the public dataset compared to the recent state-of-the-art LFPIC method. This improvement holds significant promise for benefiting the applications of focused plenoptic cameras.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1216-1227"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hierarchical Context Measurement Network for Single Hyperspectral Image Super-Resolution","authors":"Heng Wang;Cong Wang;Yuan Yuan","doi":"10.1109/TMM.2025.3535371","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535371","url":null,"abstract":"Single hyperspectral image super-resolution aims to enhance the spatial resolution of a hyperspectral image without relying on any auxiliary information. Despite the abundant spectral information, the inherent high-dimensionality in hyperspectral images still remains a challenge for memory efficiency. Recently, recursion-based methods have been proposed to reduce memory requirements. However, these methods utilize the reconstruction features as feedback embedding to explore context information, leading to sub-optimal performance as they ignore the complementarity of different hierarchical levels of information in the context. Additionally, existing methods equivalently compensate the previous feedback information to the current band, resulting in an indistinct and untargeted introduction of the context. In this paper, we propose a hierarchical context measurement network to construct corresponding measurement strategies for different hierarchical information, capturing comprehensive and powerful complementary knowledge from the context. Specifically, a feature-wise similarity measurement module is designed to calculate global cross-layer relationships between the middle features of the current band and those of the context, so as to explore the embedded middle features discriminatively through generated global dependencies. Furthermore, considering the pixel-wise correspondence between the reconstruction features and the super-resolved results, we propose a pixel-wise similarity measurement module for the complementary reconstruction features embedding, exploring detailed complementary information within the embedded reconstruction features by dynamically generating a spatially adaptive filter for each pixel. Experimental results reported on three benchmark hyperspectral datasets reveal that the proposed method outperforms other state-of-the-art peers in both visual and metric evaluations.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2623-2637"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Listen With Seeing: Cross-Modal Contrastive Learning for Audio-Visual Event Localization","authors":"Chao Sun;Min Chen;Chuanbo Zhu;Sheng Zhang;Ping Lu;Jincai Chen","doi":"10.1109/TMM.2025.3535359","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535359","url":null,"abstract":"In real-world physiological and psychological scenarios, there often exists a robust complementary correlation between audio and visual signals. Audio-Visual Event Localization (AVEL) aims to identify segments with Audio-Visual Events (AVEs) that contain both audio and visual tracks in unconstrained videos. Prior studies have predominantly focused on audio-visual cross-modal fusion methods, overlooking the fine-grained exploration of the cross-modal information fusion mechanism. Moreover, due to the inherent heterogeneity of multi-modal data, inevitable new noise is introduced during the audio-visual fusion process. To address these challenges, we propose a novel Cross-modal Contrastive Learning Network (CCLN) for AVEL, comprising a backbone network and a branch network. In the backbone network, drawing inspiration from physiological theories of sensory integration, we elucidate the process of audio-visual information fusion, interaction, and integration from an information-flow perspective. Notably, the Self-constrained Bi-modal Interaction (SBI) module is a bi-modal attention structure integrated with audio-visual fusion information, and through gated processing of the audio-visual correlation matrix, it effectively captures inter-modal correlation. The Foreground Event Enhancement (FEE) module emphasizes the significance of event-level boundaries by elongating the distance between scene events during training through adaptive weights. Furthermore, we introduce weak video-level labels to constrain the cross-modal semantic alignment of audio-visual events and design a weakly supervised cross-modal contrastive learning loss (WCCL Loss) function, which enhances the quality of fusion representation in the dual-branch contrastive learning framework. Extensive experiments conducted on the AVE dataset for both fully supervised and weakly supervised event localization, as well as Cross-Modal Localization (CML) tasks, demonstrate the superior performance of our model compared to state-of-the-art approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2650-2665"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143943983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shih-Fang Chen;Jun-Cheng Chen;I-Hong Jhuo;Yen-Yu Lin
{"title":"Improving Visual Object Tracking Through Visual Prompting","authors":"Shih-Fang Chen;Jun-Cheng Chen;I-Hong Jhuo;Yen-Yu Lin","doi":"10.1109/TMM.2025.3535323","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535323","url":null,"abstract":"Learning a discriminative model to distinguish a target from its surrounding distractors is essential to generic visual object tracking. Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts, enabling the transfer of foundation model knowledge for tracking. While CLIP offers broad category-level knowledge, the tracker, trained on instance-specific data, excels at recognizing unique object instances. Thus, PiVOT first compiles a visual prompt highlighting potential target locations. To transfer the knowledge of CLIP to the tracker, PiVOT leverages CLIP to refine the visual prompt based on the similarities between candidate objects and the reference templates across potential targets. Once the visual prompt is refined, it can better highlight potential target locations, thereby reducing irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate improved instance-aware feature maps through the guidance of the visual prompt, thus effectively reducing distractors. The proposed method does not involve CLIP during training, thereby keeping the same training complexity and preserving the generalization capability of the pretrained foundation model. Extensive experiments across multiple benchmarks indicate that PiVOT, using the proposed prompting method can suppress distracting objects and enhance the tracker.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2682-2694"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143943911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}