IEEE Transactions on Multimedia最新文献

筛选
英文 中文
Video Instance Segmentation Without Using Mask and Identity Supervision 不使用掩码和身份监督的视频实例分割
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521668
Ge Li;Jiale Cao;Hanqing Sun;Rao Muhammad Anwer;Jin Xie;Fahad Khan;Yanwei Pang
{"title":"Video Instance Segmentation Without Using Mask and Identity Supervision","authors":"Ge Li;Jiale Cao;Hanqing Sun;Rao Muhammad Anwer;Jin Xie;Fahad Khan;Yanwei Pang","doi":"10.1109/TMM.2024.3521668","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521668","url":null,"abstract":"Video instance segmentation (VIS) is a challenging vision problem in which the task is to simultaneously detect, segment, and track all the object instances in a video. Most existing VIS approaches rely on pixel-level mask supervision within a frame as well as instance-level identity annotation across frames. However, obtaining these ‘mask and identity’ annotations is time-consuming and expensive. We propose the first mask-identity-free VIS framework that neither utilizes mask annotations nor requires identity supervision. Accordingly, we introduce a query contrast and exchange network (QCEN) comprising instance query contrast and query-exchanged mask learning. The instance query contrast first performs cross-frame instance matching and then conducts query feature contrastive learning. The query-exchanged mask learning exploits both intra-video and inter-video query exchange properties: exchanging queries of an identical instance from different frames within a video results in consistent instance masks, whereas exchanging queries across videos results in all-zero background masks. Extensive experiments on three benchmarks (YouTube-VIS 2019, YouTube-VIS 2021, and OVIS) reveal the merits of the proposed approach, which significantly reduces the performance gap between the identify-free baseline and our mask-identify-free VIS method. On the YouTube-VIS 2019 validation set, our mask-identity-free approach achieves 91.4% of the stronger-supervision-based baseline performance when utilizing the same ImageNet pre-trained model.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"224-235"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Perspective Pseudo-Label Generation and Confidence-Weighted Training for Semi-Supervised Semantic Segmentation 半监督语义分割的多视角伪标签生成及置信度加权训练
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521801
Kai Hu;Xiaobo Chen;Zhineng Chen;Yuan Zhang;Xieping Gao
{"title":"Multi-Perspective Pseudo-Label Generation and Confidence-Weighted Training for Semi-Supervised Semantic Segmentation","authors":"Kai Hu;Xiaobo Chen;Zhineng Chen;Yuan Zhang;Xieping Gao","doi":"10.1109/TMM.2024.3521801","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521801","url":null,"abstract":"Self-training has been shown to achieve remarkable gains in semi-supervised semantic segmentation by creating pseudo-labels using unlabeled data. This approach, however, suffers from the quality of the generated pseudo-labels, and generating higher quality pseudo-labels is the main challenge that needs to be addressed. In this paper, we propose a novel method for semi-supervised semantic segmentation based on Multi-perspective pseudo-label Generation and Confidence-weighted Training (MGCT). First, we present a multi-perspective pseudo-label generation strategy that considers both global and local semantic perspectives. This strategy prioritizes pixels in all images by the global and local predictions, and subsequently generates pseudo-labels for different pixels in stages according to the ranking results. Our pseudo-label generation method shows superior suitability for semi-supervised semantic segmentation compared to other approaches. Second, we propose a confidence-weighted training method to alleviate performance degradation caused by unstable pixels. Our training method assigns confident weights to unstable pixels, which reduces the interference of unstable pixels during training and facilitates the efficient training of the model. Finally, we validate our approach on the PASCAL VOC 2012 and Cityscapes datasets, and the results indicate that we achieve new state-of-the-art performance on both datasets in all settings.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"300-311"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CLIP-Based Modality Compensation for Visible-Infrared Image Re-Identification 基于clip的可见-红外图像再识别模态补偿
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521764
Gang Hu;Yafei Lv;Jianting Zhang;Qian Wu;Zaidao Wen
{"title":"CLIP-Based Modality Compensation for Visible-Infrared Image Re-Identification","authors":"Gang Hu;Yafei Lv;Jianting Zhang;Qian Wu;Zaidao Wen","doi":"10.1109/TMM.2024.3521764","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521764","url":null,"abstract":"Visible-infrared image re-identification (VIReID) aims to match objects with the same identity appearing across different modalities. Given the significant differences between visible and infrared images, VIReID poses a formidable challenge. Most existing methods focus on extracting modality-shared features while ignore modality-specific features, which often also contain crucial important discriminative information. In addition, high-level semantic information of the objects, such as shape and appearance, is also crucial for the VIReID task. To further enhance the retrieval performance, we propose a novel one-stage CLIP-based Modality Compensation (CLIP-MC) method for the VIReID task. Our method introduces a new prompt learning paradigm that leverages the semantic understanding capabilities of CLIP to recover missing modality information. CLIP-MC comprises three key modules: Instance Text Prompt Generation (ITPG), Modality Compensation (MC), and Modality Context Learner (MCL). Specifically, the ITPG module facilitates effective alignment and interaction between image tokens and text tokens, enhancing the text encoder's ability to capture detailed visual information from the images. This ensures that the text encoder generates fine-grained descriptions of the images. The MCL module captures the unique information of each modality and generates modality-specific context tokens, which are more flexible compared to fixed text descriptions. Guided by the modality-specific context, the text encoder discovers missing modality information from the images and produces compensated modality features. Finally, the MC module combines the original and compensated modality features to obtain complete modality features that contain more discriminative information. We conduct extensive experiments on three VIReID datasets and compare the performance of our method with other existing approaches to demonstrate its effectiveness and superiority.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2112-2126"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Uncertainty Guided Progressive Few-Shot Learning Perception for Aerial View Synthesis 不确定性引导的渐进式少镜头学习感知鸟瞰图合成
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521727
Zihan Gao;Lingling Li;Xu Liu;Licheng Jiao;Fang Liu;Shuyuan Yang
{"title":"Uncertainty Guided Progressive Few-Shot Learning Perception for Aerial View Synthesis","authors":"Zihan Gao;Lingling Li;Xu Liu;Licheng Jiao;Fang Liu;Shuyuan Yang","doi":"10.1109/TMM.2024.3521727","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521727","url":null,"abstract":"View synthesis of aerial scenes has gained attention in the recent development of applications such as urban planning, navigation, and disaster assessment. This development is closely connected to the recent advancement of the Neural Radiance Field (NeRF). However, when autonomousaerial vehicles(AAVs) encounter constraints such as limited perspectives or energy limitations, NeRF degrades with sparsely sampled views in complex aerial scenes. On this basis, we aim to solve this problem in a few-shot manner. In this paper, we propose Uncertainty Guided Perception NeRF (UPNeRF), an uncertainty-guided perceptual learning framework that focuses on applying and improving NeRF in few-shot aerial view synthesis (FSAVS). First, simply optimizing NeRF in complex aerial scenes with sparse input can lead to overfitting in training views, resulting in a collapsed model. To address this, we propose a progressive learning strategy that utilizes the uncertainty present in sparsely sampled views, enabling a gradual transition from easy to hard learning. Second, to take advantage of the inherent inductive bias in the data, we introduce an uncertainty-aware discriminator. This discriminator leverages convolutional capabilities to capture intricate patterns in the rendered patches associated with uncertainty. Third, direct optimization of NeRF lacks prior knowledge of the scene. This, coupled with a reduction in training views, can result in unrealistic rendering. To overcome this, we present a perceptual regularizer that incorporates prior knowledge through prompt tuning of a self-supervised pre-trained vision transformer. In addition, we adopt a sampled scene annealing strategy to enhance training stability. Finally, we conducted experiments with two public datasets, and the positive results indicate our method is effective.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1177-1192"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Symmetric Hallucination With Knowledge Transfer for Few-Shot Learning 带有知识转移的对称幻觉在短时学习中的应用
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521802
Shuo Wang;Xinyu Zhang;Meng Wang;Xiangnan He
{"title":"Symmetric Hallucination With Knowledge Transfer for Few-Shot Learning","authors":"Shuo Wang;Xinyu Zhang;Meng Wang;Xiangnan He","doi":"10.1109/TMM.2024.3521802","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521802","url":null,"abstract":"Data hallucination or augmentation is a straightforward solution for few-shot learning (FSL), where FSL is proposed to classify a novel object under limited training samples. Common hallucination strategies use visual or textual knowledge to simulate the distribution of a given novel category and generate more samples for training. However, the diversity and capacity of generated samples through these techniques can be insufficient when the knowledge domain of the novel category is narrow. Therefore, the performance improvement of the classifier is limited. To address this issue, we propose a Symmetric data hallucination strategy with Knowledge Transfer (SHKT) that interacts with multi-modal knowledge in both visual and textual spaces. Specifically, we first calculate the relations based on semantic knowledge and select the most related categories of a given novel category for hallucination. Second, we design two parameter-free data hallucination strategies to enrich the training samples by mixing the given and selected samples in both visual and textual spaces. The generated visual and textual samples improve the visual representation and enrich the textual supervision, respectively. Finally, we connect the visual and textual knowledge through transfer calculation, which not only exchanges content from different modalities but also constrains the distribution of the generated samples during the training. We apply our method to four benchmark datasets and achieve state-of-the-art performance in all experiments. Specifically, compared to the baseline on the Mini-ImageNet dataset, it achieves 12.84% and 3.46% accuracy improvements for 1 and 5 support training samples, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1797-1807"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Active Multi-Target Domain Adaptation Strategy: Progressive Class Prototype Rectification 一种主动多目标域自适应策略:渐进式类原型校正
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521740
Yanan Zhu;Jiaqiu Ai;Le Wu;Dan Guo;Wei Jia;Richang Hong
{"title":"An Active Multi-Target Domain Adaptation Strategy: Progressive Class Prototype Rectification","authors":"Yanan Zhu;Jiaqiu Ai;Le Wu;Dan Guo;Wei Jia;Richang Hong","doi":"10.1109/TMM.2024.3521740","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521740","url":null,"abstract":"Compared to single-source to single-target (1S1T) domain adaptation, single-source to multi-target (1SmT) domain adaptation is more practical but also more challenging. In 1SmT scenarios, the significant differences in feature distributions between various target domains increase the difficulty for models to adapt to multiple domains. Moreover, 1SmT requires effective transfer to each target domain while maintaining performance in the source domain, demanding higher generalization capabilities from the model. In 1S1T scenarios, active domain adaptation methods improve generalization by incorporating a few target domain samples, but these methods are rarely applied in 1SmT due to potential sampling bias and outlier interference. To address this, we propose Progressive Prototype Refinement (PPR), an active multi-target domain adaptation method combining 1SmT with active learning to enhance cross-domain knowledge transfer. Specifically, an uncertainty assessment strategy is used to select representative samples from multiple target domains, forming a candidate set for model training. Based on the Lindeberg--Levy central limit theorem, we sample from a Gaussian distribution using corrected prototype statistics to augment the classifier's feature input, allowing the model to learn transitional information between domains. Finally, a mapping matrix is used for cross-domain alignment, addressing incomplete class coverage and outlier interference. Extensive experiments on multiple benchmark datasets demonstrate PPR's superior performance, with a 6.35% improvement on the PACS dataset and a 17.32% improvement on the Remote Sensing dataset.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1874-1886"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unsupervised Low-Light Image Enhancement With Self-Paced Learning 无监督低光图像增强与自我节奏学习
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521752
Yu Luo;Xuanrong Chen;Jie Ling;Chao Huang;Wei Zhou;Guanghui Yue
{"title":"Unsupervised Low-Light Image Enhancement With Self-Paced Learning","authors":"Yu Luo;Xuanrong Chen;Jie Ling;Chao Huang;Wei Zhou;Guanghui Yue","doi":"10.1109/TMM.2024.3521752","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521752","url":null,"abstract":"Low-light image enhancement (LIE) aims to restore images taken under poor lighting conditions, thereby extracting more information and details to robustly support subsequent visual tasks. While past deep learning (DL)-based techniques have achieved certain restoration effects, these existing methods treat all samples equally, ignoring the fact that difficult samples may be detrimental to the network's convergence at the initial training stages of network training. In this paper, we introduce a self-paced learning (SPL)-based LIE method named SPNet, which consists of three key components: the feature extraction module (FEM), the low-light image decomposition module (LIDM), and a pre-trained denoise module. Specifically, for a given low-light image, we first input the image, its pseudo-reference image, and its histogram-equalized version into the FEM to obtain preliminary features. Second, to avoid ambiguities during the early stages of training, these features are then adaptively fused via an SPL strategy and processed for retinex decomposition via LIDM. Third, we enhance the network performance by constraining the gradient prior relationship between the illumination components of the images. Finally, a pre-trained denoise module reduces noise inherent in LIE. Extensive experiments on nine public datasets reveal that the proposed SPNet outperforms eight state-of-the-art DL-based methods in both qualitative and quantitative evaluations and outperforms three conventional methods in quantitative assessments.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1808-1820"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HCVP: Leveraging Hierarchical Contrastive Visual Prompt for Domain Generalization HCVP:利用层次对比视觉提示实现领域泛化
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521719
Guanglin Zhou;Zhongyi Han;Shiming Chen;Biwei Huang;Liming Zhu;Tongliang Liu;Lina Yao;Kun Zhang
{"title":"HCVP: Leveraging Hierarchical Contrastive Visual Prompt for Domain Generalization","authors":"Guanglin Zhou;Zhongyi Han;Shiming Chen;Biwei Huang;Liming Zhu;Tongliang Liu;Lina Yao;Kun Zhang","doi":"10.1109/TMM.2024.3521719","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521719","url":null,"abstract":"Domain Generalization (DG) endeavors to create machine learning models that excel in unseen scenarios by learning invariant features. In DG, the prevalent practice of constraining models to a fixed structure or uniform parameterization to encapsulate invariant features can inadvertently blend specific aspects. Such an approach struggles with nuanced differentiation of inter-domain variations and may exhibit bias towards certain domains, hindering the precise learning of domain-invariant features. Recognizing this, we introduce a novel method designed to supplement the model with domain-level and task-specific characteristics. This approach aims to guide the model in more effectively separating invariant features from specific characteristics, thereby boosting the generalization. Building on the emerging trend of visual prompts in the DG paradigm, our work introduces the novel <bold>H</b>ierarchical <bold>C</b>ontrastive <bold>V</b>isual <bold>P</b>rompt (HCVP) methodology. This represents a significant advancement in the field, setting itself apart with a unique generative approach to prompts, alongside an explicit model structure and specialized loss functions. Differing from traditional visual prompts that are often shared across entire datasets, HCVP utilizes a hierarchical prompt generation network enhanced by prompt contrastive learning. These generative prompts are instance-dependent, catering to the unique characteristics inherent to different domains and tasks. Additionally, we devise a prompt modulation network that serves as a bridge, effectively incorporating the generated visual prompts into the vision transformer backbone. Experiments conducted on five DG datasets demonstrate the effectiveness of HCVP, outperforming both established DG algorithms and adaptation protocols.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1142-1152"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Progressive Knowledge Distillation From Different Levels of Teachers for Online Action Detection 不同层次教师的渐进式知识提炼用于在线行为检测
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521772
Md Moniruzzaman;Zhaozheng Yin
{"title":"Progressive Knowledge Distillation From Different Levels of Teachers for Online Action Detection","authors":"Md Moniruzzaman;Zhaozheng Yin","doi":"10.1109/TMM.2024.3521772","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521772","url":null,"abstract":"In this paper, we explore the problem of Online Action Detection (OAD), where the task is to detect ongoing actions from streaming videos without access to video frames in the future. Existing methods achieve good detection performance by capturing long-range temporal structures. However, a major challenge of this task is to detect actions at a specific time that arrive with insufficient observations. In this work, we utilize the additional future frames available at the training phase and propose a novel Knowledge Distillation (KD) framework for OAD, where a teacher network looks at more frames from the future and the student network distills the knowledge from the teacher for detecting ongoing actions from the observation up to the current frames. Usually, the conventional KD regards a high-level teacher network (i.e., the network after the last training iteration) to guide the student network throughout all training iterations, which may result in poor distillation due to the large knowledge gap between the high-level teacher and the student network at early training iterations. To remedy this, we propose a novel progressive knowledge distillation from different levels of teachers (PKD-DLT) for OAD, where in addition to a high-level teacher, we also generate several low- and middle-level teachers, and progressively transfer the knowledge (in the order of low- to high-level) to the student network throughout training iterations, for effective distillation. Evaluated on two challenging datasets THUMOS14 and TVSeries, we validate that our PKD-DLT is an effective teacher-student learning paradigm, which can be a plug-in to improve the performance of the existing OAD models and achieve a state-of-the-art.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1526-1537"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Augment One With Others: Generalizing to Unforeseen Variations for Visual Tracking 用他人增强自己:归纳视觉跟踪中不可预见的变化
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-24 DOI: 10.1109/TMM.2024.3521842
Jinpu Zhang;Ziwen Li;Ruonan Wei;Yuehuan Wang
{"title":"Augment One With Others: Generalizing to Unforeseen Variations for Visual Tracking","authors":"Jinpu Zhang;Ziwen Li;Ruonan Wei;Yuehuan Wang","doi":"10.1109/TMM.2024.3521842","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521842","url":null,"abstract":"Unforeseen appearance variation is a challenging factor for visual tracking. This paper provides a novel solution from semantic data augmentation, which facilitates offline training of trackers for better generalization. We utilize existing samples to obtain knowledge to augment another in terms of diversity and hardness. First, we propose that the similarity matching space in Siamese-like models has class-agnostic transferability. Based on this, we design the Latent Augmentation (LaAug) to transfer relevant variations and suppress irrelevant ones between training similarity embeddings of different classes. Thus the model can generalize across a more diverse semantic distribution. Then, we propose the Semantic Interaction Mix (SIMix), which interacts moments between different feature samples to contaminate structure and texture attributes and retain other semantic attributes. SIMix simulates the occlusion and complements the training distribution with hard cases. The mixed features with adversarial perturbations can empirically enable the model against external environmental disturbances. Experiments on six challenging benchmarks demonstrate that three representative tracking models, i.e., SiamBAN, TransT and OSTrack, can be consistently improved by incorporating the proposed methods without extra parameters and inference cost.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1461-1474"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信