{"title":"Frequency-Guided Spatial Adaptation for Camouflaged Object Detection","authors":"Shizhou Zhang;Dexuan Kong;Yinghui Xing;Yue Lu;Lingyan Ran;Guoqiang Liang;Hexu Wang;Yanning Zhang","doi":"10.1109/TMM.2024.3521681","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521681","url":null,"abstract":"Camouflaged object detection (COD) aims to segment camouflaged objects which exhibit very similar patterns with the surrounding environment. Recent research works have shown that enhancing the feature representation via the frequency information can greatly alleviate the ambiguity problem between the foreground objects and the background. With the emergence of vision foundation models, like InternImage, Segment Anything Model etc, adapting the pretrained model on COD tasks with a lightweight adapter module shows a novel and promising research direction. Existing adapter modules mainly care about the feature adaptation in the spatial domain. In this paper, we propose a novel frequency-guided spatial adaptation method for COD task. Specifically, we transform the input features of the adapter into frequency domain. By grouping and interacting with frequency components located within non overlapping circles in the spectrogram, different frequency components are dynamically enhanced or weakened, making the intensity of image details and contour features adaptively adjusted. At the same time, the features that are conducive to distinguishing object and background are highlighted, indirectly implying the position and shape of camouflaged object. We conduct extensive experiments on four widely adopted benchmark datasets and the proposed method outperforms 26 state-of-the-art methods with large margins. Code will be released.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"72-83"},"PeriodicalIF":8.4,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cross-Scatter Sparse Dictionary Pair Learning for Cross-Domain Classification","authors":"Lin Jiang;Jigang Wu;Shuping Zhao;Jiaxing Li","doi":"10.1109/TMM.2024.3521731","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521731","url":null,"abstract":"In cross-domain recognition tasks, the divergent distributions of data acquired from various domains degrade the effectiveness of knowledge transfer. Additionally, in practice, cross-domain data also contain a massive amount of redundant information, usually disturbing the training processes of cross-domain classifiers. Seeking to address these issues and obtain efficient domain-invariant knowledge, this paper proposes a novel cross-domain classification method, named cross-scatter sparse dictionary pair learning (CSSDL). Firstly, a pair of dictionaries is learned in a common subspace, in which the marginal distribution divergence between the cross-domain data is mitigated, and domain-invariant information can be efficiently extracted. Then, a cross-scatter discriminant term is proposed to decrease the distance between cross-domain data belonging to the same class. As such, this term guarantees that the data derived from same class can be aligned and that the conditional distribution divergence is mitigated. In addition, a flexible label regression method is introduced to match the feature representation and label information in the label space. Thereafter, a discriminative and transferable feature representation can be obtained. Moreover, two sparse constraints are introduced to maintain the sparse characteristics of the feature representation. Extensive experimental results obtained on public datasets demonstrate the effectiveness of the proposed CSSDL approach.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"371-384"},"PeriodicalIF":8.4,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Jin;Xiaoxiao Ma;Rui Zhang;Huaian Chen;Yuxuan Gu;Pengyang Ling;Enhong Chen
{"title":"Masked Video Pretraining Advances Real-World Video Denoising","authors":"Yi Jin;Xiaoxiao Ma;Rui Zhang;Huaian Chen;Yuxuan Gu;Pengyang Ling;Enhong Chen","doi":"10.1109/TMM.2024.3521818","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521818","url":null,"abstract":"Learning-based video denoisers have attained state-of-the-art (SOTA) performances on public evaluation benchmarks. Nevertheless, they typically encounter significant performance drops when applied to unseen real-world data, owing to inherent data discrepancies. To address this problem, this work delves into the model pretraining techniques and proposes masked central frame modeling (MCFM), a new video pretraining approach that significantly improves the generalization ability of the denoiser. This proposal stems from a key observation: pretraining denoiser by reconstructing intact videos from the corrupted sequences, where the central frames are masked at a suitable probability, contributes to achieving superior performance on real-world data. Building upon MCFM, we introduce a robust video denoiser, named MVDenoiser, which is firstly pretrained on massive available ordinary videos for general video modeling, and then finetuned on costful real-world noisy/clean video pairs for noisy-to-clean mapping. Additionally, beyond the denoising model, we further establish a new paired real-world noisy video dataset (RNVD) to facilitate cross-dataset evaluation of generalization ability. Extensive experiments conducted across different datasets demonstrate that the proposed method achieves superior performance compared to existing methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"622-636"},"PeriodicalIF":8.4,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ben Fei;Liwen Liu;Tianyue Luo;Weidong Yang;Lipeng Ma;Zhijun Li;Wen-Ming Chen
{"title":"Point Patches Contrastive Learning for Enhanced Point Cloud Completion","authors":"Ben Fei;Liwen Liu;Tianyue Luo;Weidong Yang;Lipeng Ma;Zhijun Li;Wen-Ming Chen","doi":"10.1109/TMM.2024.3521854","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521854","url":null,"abstract":"In partial-to-complete point cloud completion, it is imperative that enabling every patch in the output point cloud faithfully represents the corresponding patch in partial input, ensuring similarity in terms of geometric content. To achieve this objective, we propose a straightforward method dubbed PPCL that aims to maximize the mutual information between two point patches from the encoder and decoder by leveraging a contrastive learning framework. Contrastive learning facilitates the mapping of two similar point patches to corresponding points in a learned feature space. Notably, we explore multi-layer point patches contrastive learning (MPPCL) instead of operating on the whole point cloud. The negatives are exploited within the input point cloud itself rather than the rest of the datasets. To fully leverage the local geometries present in the partial inputs and enhance the quality of point patches in the encoder, we introduce Multi-level Feature Learning (MFL) and Hierarchical Feature Fusion (HFF) modules. These modules are also able to facilitate the learning of various levels of features. Moreover, Spatial-Channel Transformer Point Up-sampling (SCT) is devised to guide the decoder to construct a complete and fine-grained point cloud by leveraging enhanced point patches from our point patches contrastive learning. Extensive experiments demonstrate that our PPCL can achieve better quantitive and qualitative performance over off-the-shelf methods across various datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"581-596"},"PeriodicalIF":8.4,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization","authors":"Yunlong Tang;Yuxuan Wan;Lei Qi;Xin Geng","doi":"10.1109/TMM.2024.3521671","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521671","url":null,"abstract":"Source-Free Domain Generalization (SFDG) aims to develop a model that works for unseen target domains without relying on any source domain. Research in SFDG primarily bulids upon the existing knowledge of large-scale vision-language models and utilizes the pre-trained model's joint vision-language space to simulate style transfer across domains, thus eliminating the dependency on source domain images. However, how to efficiently simulate rich and diverse styles using text prompts, and how to extract domain-invariant information useful for classification from features that contain both semantic and style information after the encoder, are directions that merit improvement. In this paper, we introduce Dynamic PromptStyler (DPStyler), comprising Style Generation and Style Removal modules to address these issues. The Style Generation module refreshes all styles at every training epoch, while the Style Removal module eliminates variations in the encoder's output features caused by input styles. Moreover, since the Style Generation module, responsible for generating style word vectors using random sampling or style mixing, makes the model sensitive to input text prompts, we introduce a model ensemble method to mitigate this sensitivity. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on benchmark datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"120-132"},"PeriodicalIF":8.4,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"List of Reviewers","authors":"","doi":"10.1109/TMM.2024.3501532","DOIUrl":"https://doi.org/10.1109/TMM.2024.3501532","url":null,"abstract":"","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11428-11439"},"PeriodicalIF":8.4,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10823085","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual Semantic Reconstruction Network for Weakly Supervised Temporal Sentence Grounding","authors":"Kefan Tang;Lihuo He;Nannan Wang;Xinbo Gao","doi":"10.1109/TMM.2024.3521676","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521676","url":null,"abstract":"Weakly supervised temporal sentence grounding aims to identify semantically relevant video moments in an untrimmed video corresponding to a given sentence query without exact timestamps. Neuropsychology research indicates that the way the human brain handles information varies based on the grammatical categories of words, highlighting the importance of separately considering nouns and verbs. However, current methodologies primarily utilize pre-extracted video features to reconstruct randomly masked queries, neglecting the distinction between grammatical classes. This oversight could hinder forming meaningful connections between linguistic elements and the corresponding components in the video. To address this limitation, this paper introduces the dual semantic reconstruction network (DSRN) model. DSRN processes video features by distinctly correlating object features with nouns and motion features with verbs, thereby mimicking the human brain's parsing mechanism. It begins with a feature disentanglement module that separately extracts object-aware and motion-aware features from video content. Then, in a dual-branch structure, these disentangled features are used to generate separate proposals for objects and motions through two dedicated proposal generation modules. A consistency constraint is proposed to ensure a high level of agreement between the boundaries of object-related and motion-related proposals. Subsequently, the DSRN independently reconstructs masked nouns and verbs from the sentence queries using the generated proposals. Finally, an integration block is applied to synthesize the two types of proposals, distinguishing between positive and negative instances through contrastive learning. Experiments on the Charades-STA and ActivityNet Captions datasets demonstrate that the proposed method achieves state-of-the-art performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"95-107"},"PeriodicalIF":8.4,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kun Dai;Zhiqiang Jiang;Tao Xie;Ke Wang;Dedong Liu;Zhendong Fan;Ruifeng Li;Lijun Zhao;Mohamed Omar
{"title":"SOFW: A Synergistic Optimization Framework for Indoor 3D Object Detection","authors":"Kun Dai;Zhiqiang Jiang;Tao Xie;Ke Wang;Dedong Liu;Zhendong Fan;Ruifeng Li;Lijun Zhao;Mohamed Omar","doi":"10.1109/TMM.2024.3521782","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521782","url":null,"abstract":"In this work, we observe that indoor 3D object detection across varied scene domains encompasses both universal attributes and specific features. Based on this insight, we propose SOFW, a synergistic optimization framework that investigates the feasibility of optimizing 3D object detection tasks concurrently spanning several dataset domains. The core of SOFW is identifying domain-shared parameters to encode universal scene attributes, while employing domain-specific parameters to delve into the particularities of each scene domain. Technically, we introduce a set abstraction alteration strategy (SAAS) that embeds learnable domain-specific features into set abstraction layers, thus empowering the network with a refined comprehension for each scene domain. Besides, we develop an element-wise sharing strategy (ESS) to facilitate fine-grained adaptive discernment between domain-shared and domain-specific parameters for network layers. Benefited from the proposed techniques, SOFW crafts feature representations for each scene domain by learning domain-specific parameters, whilst encoding generic attributes and contextual interdependencies via domain-shared parameters. Built upon the classical detection framework VoteNet without any complicated modules, SOFW delivers impressive performances under multiple benchmarks with much fewer total storage footprint. Additionally, we demonstrate that the proposed ESS is a universal strategy and applying it to a voxels-based approach TR3D can realize cutting-edge detection accuracy on all S3DIS, ScanNet, and SUN RGB-D datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"637-651"},"PeriodicalIF":8.4,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Disaggregation Distillation for Person Search","authors":"Yizhen Jia;Rong Quan;Haiyan Chen;Jiamei Liu;Yichao Yan;Song Bai;Jie Qin","doi":"10.1109/TMM.2024.3521732","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521732","url":null,"abstract":"Person search is a challenging task in computer vision and multimedia understanding, which aims at localizing and identifying target individuals in realistic scenes. State-of-the-art models achieve remarkable success but suffer from overloaded computation and inefficient inference, making them impractical in most real-world applications. A promising approach to tackle this dilemma is to compress person search models with knowledge distillation (KD). Previous KD-based person search methods typically distill the knowledge from the re-identification (re-id) branch, completely overlooking the useful knowledge from the detection branch. In addition, we elucidate that the imbalance between person and background regions in feature maps has a negative impact on the distillation process. To this end, we propose a novel KD-based approach, namely Disaggregation Distillation for Person Search (DDPS), which disaggregates the distillation process and feature maps, respectively. Firstly, the distillation process is disaggregated into two task-oriented sub-processes, <italic>i.e.</i>, detection distillation and re-id distillation, to help the student learn both accurate localization capability and discriminative person embeddings. Secondly, we disaggregate each feature map into person and background regions, and distill these two regions independently to alleviate the imbalance problem. More concretely, three types of distillation modules, <italic>i.e.</i>, logit distillation (LD), correlation distillation (CD), and disaggregation feature distillation (DFD), are particularly designed to transfer comprehensive information from the teacher to the student. Note that such a simple yet effective distillation scheme can be readily applied to both homogeneous and heterogeneous teacher-student combinations. We conduct extensive experiments on two person search benchmarks, where the results demonstrate that, surprisingly, our DDPS enables the student model to surpass the performance of the corresponding teacher model, even achieving comparable results with general person search models.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"158-170"},"PeriodicalIF":8.4,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}