{"title":"ETC: Temporal Boundary Expand Then Clarify for Weakly Supervised Video Grounding With Multimodal Large Language Model","authors":"Guozhang Li;Xinpeng Ding;De Cheng;Jie Li;Nannan Wang;Xinbo Gao","doi":"10.1109/TMM.2024.3521758","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521758","url":null,"abstract":"Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotations, explicit supervision methods (i.e., generating pseudo-temporal boundaries for training) have achieved great success. However, data augmentation in these methods might disrupt critical temporal information, yielding poor pseudo-temporal boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose <bold>ETC</b> (<bold>E</b>xpand <bold>t</b>hen <bold>C</b>larify), first using the additional information to expand the initial incomplete pseudo-temporal boundaries, and subsequently refining these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multi-modal large language models (MLLMs) to annotate each frame within the initial pseudo-temporal boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise in expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1772-1782"},"PeriodicalIF":8.4,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QRNet: Quaternion-Based Refinement Network for Surface Normal Estimation","authors":"Hanlin Bai;Xin Gao;Wei Deng;Jianwang Gan;Yijin Xiong;Kangkang Kou;Guoying Zhang","doi":"10.1109/TMM.2025.3535299","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535299","url":null,"abstract":"In recent years, there has been a notable increase in interest in image-based surface normal estimation. These approaches are capable of predicting the surface normal of real scenes using only an image, thereby facilitating a more profound comprehension of the actual scene and providing assistance with other perceptual tasks. However, dense regression predictions are susceptible to misdirection when encountering intricate details, which presents a paradoxical challenge for image-based surface normal estimation in reconciling detail and density. By introducing quaternion rotations as fusion module with geometric property, we propose a quaternion-based refined network structure that fuses detailed and structural information. Specifically, we design a high-resolution surface normal baseline with a streamlined structure, to extract fine-grained features while reducing the angular error in surface normal regression values caused by downsampling. Additionally, we propose a subtle angle loss function that prevents subtle changes from being overlooked without extra information, further enhancing the model's ability to learn detailed information. The proposed method demonstrates state-of-the-art performance compared to existing techniques on three real-world datasets comprising indoor and outdoor scenes. The results demonstrate the robust effectiveness of our deep learning approach that incorporates geometric prior guidance, highlighting improved robustness in applying deep learning methods. The source code will be released upon acceptance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3356-3369"},"PeriodicalIF":8.4,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Neural Adaptive Wireless Video Streaming via Cross-Layer Information Exposure and Online Tuning","authors":"Lingzhi Zhao;Ying Cui;Yuhang Jia;Yunfei Zhang;Klara Nahrstedt","doi":"10.1109/TMM.2024.3521820","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521820","url":null,"abstract":"Deep reinforcement learning (DRL) demonstrates its promising potential in adaptive video streaming and has recently received increasing attention. However, existing DRL-based methods for adaptive video streaming mainly use application (APP) layer information, adopt heuristic training methods, and are not robust against continuous network fluctuations. This paper aims to boost the quality of experience (QoE) of adaptive wireless video streaming by using cross-layer information, deriving a rigorous training method, and adopting effective online tuning methods with real-time data. First, we formulate a more comprehensive and accurate adaptive wireless video streaming problem as an infinite stage discounted Markov decision process (MDP) problem by additionally incorporating past and lower-layer information. This formulation allows a flexible tradeoff between QoE and computational and memory costs for solving the problem. In the offline scenario (only with pre-collected data), we propose an enhanced asynchronous advantage actor-critic (eA3C) method by jointly optimizing the parameters of parameterized policy and value function. Specifically, we build an eA3C network consisting of a policy network and a value network that can utilize cross-layer, past, and current information and jointly train the eA3C network using pre-collected samples. In the online scenario (with additional real-time data), we propose two continual learning-based online tuning methods for designing better policies for a specific user with different QoE and training time tradeoffs. The proposed online tuning methods are robust against continuous network fluctuations and more general and flexible than the existing online tuning methods. Finally, experimental results show that the proposed offline policy can improve the QoE by 6.8% to 14.4% compared to the state-of-the-arts in the offline scenario, and the proposed online policies can achieve <inline-formula><tex-math>$6.3%$</tex-math></inline-formula> to 55.8% gains in QoE over the state-of-the-arts in the online scenario.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1289-1304"},"PeriodicalIF":8.4,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shishun Tian;Tiantian Zeng;Zhengyu Zhang;Wenbin Zou;Xia Li
{"title":"Dual Residual-Guided Interactive Learning for the Quality Assessment of Enhanced Images","authors":"Shishun Tian;Tiantian Zeng;Zhengyu Zhang;Wenbin Zou;Xia Li","doi":"10.1109/TMM.2024.3521734","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521734","url":null,"abstract":"Image enhancement algorithms can facilitate computer vision tasks in real applications. However, various distortions may also be introduced by image enhancement algorithms. Therefore, the image quality assessment (IQA) plays a crucial role in accurately evaluating enhanced images to provide dependable feedback. Current enhanced IQA methods are mainly designed for single specific scenarios, resulting in limited performance in other scenarios. Besides, no-reference methods predict quality utilizing enhanced images alone, which ignores the existing degraded images that contain valuable information, are not reliable enough. In this work, we propose a degraded-reference image quality assessment method based on dual residual-guided interactive learning (DRGQA) for the enhanced images in multiple scenarios. Specifically, a global and local feature collaboration module (GLCM) is proposed to imitate the perception of observers to capture comprehensive quality-aware features by using convolutional neural networks (CNN) and Transformers in an interactive manner. Then, we investigate the structure damage and color shift distortions that commonly occur in the enhanced images and propose a dual residual-guided module (DRGM) to make the model concentrate on the distorted regions that are sensitive to human visual system (HVS). Furthermore, a distortion-aware feature enhancement module (DEM) is proposed to improve the representation abilities of features in deeper networks. Extensive experimental results demonstrate that our proposed DRGQA achieves superior performance with lower computational complexity compared to the state-of-the-art IQA methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1637-1651"},"PeriodicalIF":8.4,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ke Gu;Yuchen Liu;Hongyan Liu;Bo Liu;Junfei Qiao;Weisi Lin;Wenjun Zhang
{"title":"Air Pollution Monitoring by Integrating Local and Global Information in Self-Adaptive Multiscale Transform Domain","authors":"Ke Gu;Yuchen Liu;Hongyan Liu;Bo Liu;Junfei Qiao;Weisi Lin;Wenjun Zhang","doi":"10.1109/TMM.2025.3535351","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535351","url":null,"abstract":"This paper proposed a novel image-based air pollution monitor (IAPM) by incorporating local and global information in the self-adaptive multiscale transform domain, so as to achieve the timely and effective leakage detection of typical air pollutants from a single image. To be specific, this paper first developed a screen-shaped module according to two significant findings in visual neuroscience, which include the high sensitivity of human eyes to horizontal and vertical stimuli and the center-surround inhibition, by designing and fusing the square module, horizontal strip module and vertical strip module parallelly for simulating the behaviour of human eyes to extract local features. Second, the learnable weights and proportional mapping were applied to incorporate the screen-shaped module and lightweight vision transformer as backbone, towards more richly exploiting and fusing local and global information just as the way a brain perceives external stimuli. Third, a new self-adaptive multiscale transform domain method was devised based on two motivations from the visual characteristics of multiscale perception and the brain characteristics of self-adaptive domain transform to modify the backbone by using the operations of pooling and pointwise convolution. Extensive experiments implemented on the datasets of carbon particulate matters and ethylene leakage confirmed the superior monitoring performance of the proposed IAPM model beyond the state-of-the-art (SOTA) peers by an accuracy gain of about 4%. Furthermore, the proposed IAPM model only required 0.089 GFLOPs and 0.15 million model parameters, remarkably outperforming SOTA competitors in computational efficiency and storage resources.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3716-3728"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Privacy-Preserving Image Inpainting Using Markov Random Field Modeling","authors":"Ping Kong;An Li;Daidou Guo;Liang Zhou;Chuan Qin;Xinpeng Zhang","doi":"10.1109/TMM.2025.3535382","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535382","url":null,"abstract":"Cloud services have attracted extensive attention due to low cost, agility and mobility. However, when processing data on cloud servers, users may worry about semi-honest third parties stealing private information from them, hence, data encryption is applied for privacy protection. Inpainting is a technique that reconstructs certain undesirable regions in an image through an imperceptible manner, which can be accomplished by searching for well-matching candidate patches and copying them to to-be-inpainted locations. However, when the image is encrypted, the matched candidate patch searching is a challenging dilemma. Therefore, tackling these data-privacy issues for image inpainting over a cloud infrastructure, we propose an image inpainting scheme using Markov random field (MRF) modeling in encrypted domain. In this scheme, the sender encrypts the to-be-inapinted image by using a homomorphic cryptosystem that supports homomorphic ciphertext comparison. Then, the cloud realizes the MRF-based inpainting for encrypted images through some specific homomorphic operations. In addition, secure context descriptors are utilized to improve the inpainting of textures and structures. Finally, the receiver obtains the inpainted result through image decryption. The proposed scheme is proved to be secure through various cryptographic attacks. Qualitative and quantitative results demonstrate our scheme achieves better inpainted results in structure compared with state-of-the-art schemes in encrypted domain.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3688-3701"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhao-Min Chen;Quan Cui;Xiaoqin Zhang;Ruoxi Deng;Chaoqun Xia;Shijian Lu
{"title":"Towards Gradient Equalization and Feature Diversification for Long-Tailed Multi-Label Image Recognition","authors":"Zhao-Min Chen;Quan Cui;Xiaoqin Zhang;Ruoxi Deng;Chaoqun Xia;Shijian Lu","doi":"10.1109/TMM.2025.3535395","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535395","url":null,"abstract":"Multi-label image recognition with convolutional neural networks has achieved remarkable progress in the past few years. However, most existing multi-label image recognition methods suffer from the long-tailed data distribution problem, <italic>i.e.</i>, head categories occupy most training samples, while tailed classes have few samples. This work firstly studies the influence of long-tailed data distribution on existing multi-label image recognition methods. Based on this, two crucial issues of the existing methods are identified: 1) severe gradient imbalance between head and tailed categories, even though re-balancing strategies are adopted; 2) the lack of diversity of tail category training samples. To tackle the first issue, this paper proposes a group sampling strategy to create group-wise balanced data distribution. Meanwhile, a dynamic gradient balancing loss is proposed to equalize the gradient for all categories. To tackle the second issue, this paper proposes a diversity enhancement module to fuse the information across all categories, preventing the network from overfitting tail classes. Furthermore, it also balances the gradient, promoting the discriminability of learned classifiers. Our method significantly outperforms the baseline method and achieves competitive performance with state-of-the-art methods on VOC-LT and COCO-LT datasets. Extensive ablation studies are conducted to verify the effectiveness of the essential proposals.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3489-3500"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IMU-Assisted Gray Pixel Shift for Video White Balance Stabilization","authors":"Lei Zhang;Xin Chen;Zichen Wang","doi":"10.1109/TMM.2025.3535396","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535396","url":null,"abstract":"Video white balance is to correct the scene color of video frames to the color under the standard white illumination. Due to the camera movement, video white balance usually suffers temporal instability with unnatural color change between frames. This paper presents a video white balance stabilization method for spatially correct and temporally stable color correction. It exploits the color invariance at the position of the same object to obtain the consistent illumination color estimation through frames. Specifically, it detects gray pixels that inherit the potential illumination color, and their inter-frame motion calculated with the assistance of inertial measurement unit (IMU) is used to carry gray pixels for establishing their correspondence and color fusion between adjacent frames. Because the IMU has more robust and accurate motion cues against large camera movement and texture-less regions in the scene, our method can generate better gray pixel correspondences and illumination color estimation for the white balance stabilization. Besides, our method is computationally efficient to be deployed on mobile phones. Experimental results show that our method can significantly improve the temporal stability as well as maintain the spatial correctness of white balance for videos recorded by cameras equipped with IMU sensors.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3664-3676"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ATM-NeRF: Accelerating Training for NeRF Rendering on Mobile Devices via Geometric Regularization","authors":"Yang Chen;Lin Zhang;Shengjie Zhao;Yicong Zhou","doi":"10.1109/TMM.2025.3535288","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535288","url":null,"abstract":"Recently, an increasing number of researchers have been dedicated to transferring the impressive novel view synthesis capability of Neural Radiance Fields (NeRF) to resource-constrained mobile devices. One common solution is to pre-train NeRF and bake it into textured meshes which are well supported by mobile graphics hardware. However, the training process of existing methods often requires several hours even with multiple high-end NVIDIA V100 GPUs. The underlying reason is that these schemes mainly rely on photometric rendering loss, neglecting the geometric relationship between the pre-trained NeRF and the baked results. Standing on this point, we present <bold>ATM-NeRF</b> (<bold>A</b>ccelerating <bold>T</b>raining for <bold>M</b>obile rendering based on <bold>NeRF</b>), which is the first to apply effective geometric regularization constraints during both the pre-training and the baking training stages for faster convergence. Specifically, in the initial NeRF pre-training stage, we enforce consistency of the multi-resolution density grids representing the scene geometry to mitigate the shape-radiance ambiguity problem to some extent, achieving a coarse mesh with smoothness. In the second stage, we utilize the positions and geometric features of 3D points projected from the pre-trained posed depths to provide geometric supervision for joint refinement of geometry and appearance of the coarse mesh. As a result, our ATM-NeRF achieves comparable rendering quality to MobileNeRF with a training speed that is about <inline-formula><tex-math>$30times sim 70times$</tex-math></inline-formula> faster while maintaining finer structure details of the exported mesh.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3279-3293"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Wang;Yibing Zhan;Leilei Ma;Dapeng Tao;Liang Ding;Chen Gong
{"title":"SpliceMix: A Cross-Scale and Semantic Blending Augmentation Strategy for Multi-Label Image Classification","authors":"Lei Wang;Yibing Zhan;Leilei Ma;Dapeng Tao;Liang Ding;Chen Gong","doi":"10.1109/TMM.2025.3535387","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535387","url":null,"abstract":"Recently, Mix-style data augmentation methods (<italic>e.g</i>., Mixup and CutMix) have shown promising performance in various visual tasks. However, these methods are primarily designed for single-label images, ignoring the considerable discrepancies between single- and multi-label images, <italic>i.e</i>., a multi-label image involves multiple co-occurred categories and fickle object scales. On the other hand, previous multi-label image classification (MLIC) methods tend to design elaborate models, bringing expensive computation. In this article, we introduce a simple but effective augmentation strategy for multi-label image classification, namely SpliceMix. The “splice” in our method is two-fold: <italic>1)</i> Each mixed image is a splice of several downsampled images in the form of a grid, where the semantics of images attending to mixing are blended without object deficiencies for alleviating co-occurred bias; <italic>2)</i> We splice mixed images and the original mini-batch to form a new SpliceMixed mini-batch, which allows an image with different scales to contribute to training together. Furthermore, such splice in our SpliceMixed mini-batch enables interactions between mixed images and original regular images. We also provide a simple and non-parametric extension based on consistency learning (SpliceMix-CL) to show the potential of extending our SpliceMix. Extensive experiments on various tasks demonstrate that only using SpliceMix with a baseline model (<italic>e.g</i>., ResNet) achieves better performance than state-of-the-art methods. Moreover, the generalizability of our SpliceMix is further validated by the improvements in current MLIC methods when married with our SpliceMix.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3251-3265"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144281302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}