IEEE Transactions on Circuits and Systems for Video Technology最新文献

筛选
英文 中文
Cross-Domain Animal Pose Estimation With Skeleton Anomaly-Aware Learning 基于骨骼异常感知学习的跨域动物姿态估计
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-04 DOI: 10.1109/TCSVT.2025.3557844
Le Han;Kaixuan Chen;Lei Zhao;Yangbo Jiang;Pengfei Wang;Nenggan Zheng
{"title":"Cross-Domain Animal Pose Estimation With Skeleton Anomaly-Aware Learning","authors":"Le Han;Kaixuan Chen;Lei Zhao;Yangbo Jiang;Pengfei Wang;Nenggan Zheng","doi":"10.1109/TCSVT.2025.3557844","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3557844","url":null,"abstract":"Animal pose estimation is often constrained by the scarcity of annotations and the diversity of scenarios and species. The pseudo-label generation based unsupervised domain adaptation paradigm, which discriminates the predicted keypoints of unlabeled data based on the skeleton position consistency, has demonstrated effectiveness for such problems. However, existing methods generate pseudo-labels with massive false positives, because they cannot effectively distinguish sample pairs with the same errors. In this study, we propose a cross-domain animal pose estimation model from a novel perspective of skeleton anomaly learning. We construct a graph contrastive learning mechanism to acquire the skeleton anomaly-aware knowledge, which enables the generation of accurate pseudo-labels for target domain and imposes graph constraint on unlabeled data. And a skeleton anomaly-feedback based domain adaptation framework is designed to facilitate implicit alignment of object-specific features and joint training of cross-domain. Besides, we propose a novel rat pose dataset named UDARP-9.4K to address the gap of small-sized animal pose datasets encompassing diverse experimental scenarios. The related datasets are reviewed and evaluated in detail. Extensive experiments are conducted on UDARP-9.4K and two public datasets to demonstrate the superiority of the proposed model in cross-scenarios and cross-species animal pose estimation tasks. Further analysis reveals the effectiveness of the proposed model for skeleton structure feature learning. <italic>The UDARP-9.4K dataset is available here</i> <uri>https://github.com/CSDLLab/UDARP-9.4K-Dataset</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9148-9160"},"PeriodicalIF":11.1,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MambaVT: Spatio-Temporal Contextual Modeling for Robust RGB-T Tracking MambaVT:鲁棒RGB-T跟踪的时空上下文建模
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-04 DOI: 10.1109/TCSVT.2025.3557992
Simiao Lai;Chang Liu;Jiawen Zhu;Ben Kang;Yang Liu;Dong Wang;Huchuan Lu
{"title":"MambaVT: Spatio-Temporal Contextual Modeling for Robust RGB-T Tracking","authors":"Simiao Lai;Chang Liu;Jiawen Zhu;Ben Kang;Yang Liu;Dong Wang;Huchuan Lu","doi":"10.1109/TCSVT.2025.3557992","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3557992","url":null,"abstract":"Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt image-pair appearance matching and face challenges of the intrinsic high quadratic complexity of the attention mechanism, resulting in constrained exploitation of temporal information. Inspired by the recently emerged State Space Model Mamba, renowned for its impressive long sequence modeling capabilities and linear computational complexity, this work innovatively proposes a pure Mamba-based framework (<bold>MambaVT</b>) to fully exploit spatio-temporal contextual modeling for robust <bold>v</b>isible-<bold>t</b>hermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations, and introduce short-term historical trajectory prompts to predict the subsequent target states based on local temporal location clues. Extensive experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks while requiring lower computational costs. We aim for this work to serve as a simple yet strong baseline, stimulating future research in this field. The code and pre-trained models will be made available.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9312-9323"},"PeriodicalIF":11.1,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DinoQuery: Promoting Small 3D Object Detection With Textual Prompt DinoQuery:用文本提示促进小型3D物体检测
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-04 DOI: 10.1109/TCSVT.2025.3557950
Tong Ning;Ke Lu;Xirui Jiang;Hongjuan Pei;Jian Xue
{"title":"DinoQuery: Promoting Small 3D Object Detection With Textual Prompt","authors":"Tong Ning;Ke Lu;Xirui Jiang;Hongjuan Pei;Jian Xue","doi":"10.1109/TCSVT.2025.3557950","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3557950","url":null,"abstract":"Query-based 3D object detection has gained significant success in the application of autonomous driving due to its ability to achieve good performance while maintaining low computational cost. However, it still struggles with the reliable detection of small objects such as bicycles and pedestrians. To address this challenge, this paper introduces a novel sparse query-based approach, termed DinoQuery. This approach utilizes Grounding-DINO with textual prompts to select small-sized objects and generate 2D category-aware queries. These 2D category-aware queries combined with 2D global queries are then lifted to 3D queries by associating each sampled query with its respective 3D position, orientation, and size. The validity of these 3D queries, along with the 2D queries, is verified by the Comprehensive Contrastive Learning (CCL) mechanism. This is achieved by aligning all 2D and 3D queries with their respective 2D and 3D ground truth labels, and computing similarity to select true positive and false positive queries. Then a contrastive loss is introduced to enhance true positive queries and weaken false positive ones based on geometric and semantic similarity. The DinoQuery was tested on the nuScenes dataset and demonstrated excellent performance. Notably, the largest increase of our method is 3.2% on NDS and 3.1% on mAP.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8639-8652"},"PeriodicalIF":11.1,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fully Semantic Gap Recovery for End-to-End Image Captioning 端到端图像字幕的完全语义差距恢复
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-04 DOI: 10.1109/TCSVT.2025.3558088
Jingchun Gao;Lei Zhang;Jingyu Li;Zhendong Mao
{"title":"Fully Semantic Gap Recovery for End-to-End Image Captioning","authors":"Jingchun Gao;Lei Zhang;Jingyu Li;Zhendong Mao","doi":"10.1109/TCSVT.2025.3558088","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558088","url":null,"abstract":"Image captioning (IC) involves the comprehension of images from the visual domain to generate descriptions that are grounded in visual elements within the linguistic domain. Current image captioning methods typically rely on pre-trained unimodal visual backbones or vision-language models to identify visual entities. Subsequently, these methods employ unimodal self-attention fusion to uncover high-level semantic associations. However, we uncover this paradigm suffers from the inherent intra-modal semantic gap from the input features. Unimodal pre-trained visual features lack sufficient linguistic semantic information due to the modality misalignment. Furthermore, contrastive pre-trained vision-language models, such as CLIP, confine to the global cross-modal alignment, leading to local visual features belonging to the same object exhibiting distinct semantics. Given the semantically insufficient visual features, unimodal self-attention fusion struggles to accurately capture semantic associations among visual patches, thereby exacerbating the semantic gap. This gap results in inaccurate visual entities and associations in the generated captions. Therefore, we propose a novel Fully Semantic Gap Recovery (FSGR) method to broaden the robust cross-modal bridge of CLIP into a fine-grained level and consolidate vision-language semantic associations for more precise visual comprehension. Technically, we first propose a local contrastive learning method to aggregate the semantically similar visual patches. Next, we design a semantic quantification module to abstract the language-bridged visual map from the enhanced local visual features. Finally, fine-grained cross-modal interaction consolidates the image patches with their corresponding linguistic semantics, allowing the generation of plausible captions based on the aggregated features. Extensive experiments on comprehensive metrics demonstrate that our model has achieved new state-of-the-art performance on the MSCOCO dataset, while also exhibiting competitive cross-domain capability on the Nocaps dataset. Source code released at <uri>https://github.com/gjc0824/FSGR</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9365-9383"},"PeriodicalIF":11.1,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
fMRI2GES: Co-Speech Gesture Reconstruction From fMRI Signal With Dual Brain Decoding Alignment fMRI2GES:基于双脑解码对齐的fMRI信号的语音手势重建
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-04 DOI: 10.1109/TCSVT.2025.3558125
Chunzheng Zhu;Jialin Shao;Jianxin Lin;Yijun Wang;Jing Wang;Jinhui Tang;Kenli Li
{"title":"fMRI2GES: Co-Speech Gesture Reconstruction From fMRI Signal With Dual Brain Decoding Alignment","authors":"Chunzheng Zhu;Jialin Shao;Jianxin Lin;Yijun Wang;Jing Wang;Jinhui Tang;Kenli Li","doi":"10.1109/TCSVT.2025.3558125","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558125","url":null,"abstract":"Understanding how the brain responds to external stimuli and decoding this process has been a significant challenge in neuroscience. While previous studies typically concentrated on brain-to-image and brain-to-language reconstruction, our work strives to reconstruct gestures associated with speech stimuli perceived by brain. Unfortunately, the lack of paired {brain, speech, gesture} data hinders the deployment of deep learning models for this purpose. In this paper, we introduce a novel approach, fMRI2GES, that allows training of fMRI-to-gesture reconstruction networks on unpaired data using Dual Brain Decoding Alignment. This method relies on two key components: 1) observed texts that elicit brain responses, and 2) textual descriptions associated with the gestures. Then, instead of training models in a completely supervised manner to find a mapping relationship among the three modalities, we harness an fMRI-to-text model, a text-to-gesture model with paired data and an fMRI-to-gesture model with unpaired data, establishing dual fMRI-to-gesture reconstruction patterns. Afterward, we explicitly align two outputs and train our model in a self-supervision way. We show that our proposed method can reconstruct expressive gestures directly from fMRI recordings. We also investigate fMRI signals from different ROIs in the cortex and how they affect generation results. Overall, we provide new insights into decoding co-speech gestures, thereby advancing our understanding of neuroscience and cognitive science.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9017-9029"},"PeriodicalIF":11.1,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Meta-Learning With Task-Adaptive Selection 具有任务自适应选择的元学习
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-03 DOI: 10.1109/TCSVT.2025.3557706
Quan Wan;Maofa Wang;Weifeng Shan;Bin Wang;Lu Zhang;Zhixiong Leng;Bingchen Yan;Yanlin Xu;Huiling Chen
{"title":"Meta-Learning With Task-Adaptive Selection","authors":"Quan Wan;Maofa Wang;Weifeng Shan;Bin Wang;Lu Zhang;Zhixiong Leng;Bingchen Yan;Yanlin Xu;Huiling Chen","doi":"10.1109/TCSVT.2025.3557706","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3557706","url":null,"abstract":"The gradient-based meta-learning algorithm gains meta-learning parameters from a pool of tasks. Starting from the obtained meta-learning parameters, it can achieve better results through fast fine-tuning with only a few gradient descent updates. The two-layer meta-learning approach that shares initialization parameters has achieved good results in solving few-shot learning domain. However, in the training of multiple similar tasks in the inner layer, the difficulty and benefits of the tasks have been consistently overlooked, resulting in conflicts between tasks and ultimately compromising the model to unexpected positions. Therefore, this paper proposes a task-adaptive selection meta-learning algorithm called TSML. Specifically, we construct a task selection trainer to assess the difficulty of tasks and calculate their future benefits. Designing more optimal training strategies for each task based on difficulty and benefit, altering the current compromise in multi-task settings, and balancing the impact of tasks on meta-learning parameters. Additionally, the outer meta-parameter updating method for traditional meta-learning has been adjusted, enabling the meta-parameters to attain a better position. By doing so, we can rapidly improve the generalization and convergence of the meta-learning parameters on unknown tasks. Experimental results indicate a 2.1% improvement over the base model in the 4-conv setting, with a more pronounced effect as the neural network is progressively complexified, reaching a 4.1% improvement in resnet12.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8627-8638"},"PeriodicalIF":11.1,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FRPGS: Fast, Robust, and Photorealistic Monocular Dynamic Scene Reconstruction With Deformable 3D Gaussians FRPGS:快速,鲁棒,逼真的单目动态场景重建与可变形的三维高斯
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-02 DOI: 10.1109/TCSVT.2025.3557012
Wan Li;Xiao Pan;Jiaxin Lin;Ping Lu;Daquan Feng;Wenzhe Shi
{"title":"FRPGS: Fast, Robust, and Photorealistic Monocular Dynamic Scene Reconstruction With Deformable 3D Gaussians","authors":"Wan Li;Xiao Pan;Jiaxin Lin;Ping Lu;Daquan Feng;Wenzhe Shi","doi":"10.1109/TCSVT.2025.3557012","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3557012","url":null,"abstract":"Dynamic reconstruction technology presents significant promise for applications in visual and interactive fields. Current techniques utilizing 3D Gaussian Splatting show favorable results and fast reconstruction speed. However, as scene expanding, using individual Gaussian structure 1) leads to instability in large-scale dynamic reconstruction, marked by abrupt deformation, and 2) the heuristic densification of individuals suffers significant redundancy. Tackling these issues, we propose a jointed Gaussian representation method named FRPGS, which learns the global information and the deformation using center Gaussians and generates the neural Gaussians around them for local detail. Specifically, FRPGS employs center Gaussians initialized from point clouds, which are learned with a deformation field for representing global relationships and dynamic motion over time. Then, for each center Gaussian, attribute networks generate neural Gaussians that move under the linked center Gaussian driving, thereby ensuring structural integrity during movement within this joint-based representation. Finally, to reduce Gaussian redundancy, a densification strategy is developed based on the average cumulative gradient of the associated neural Gaussians, imposing strict limits on the growing of center Gaussians without compromising accuracy. Additionally, we established a large-scale dynamic indoor dataset at the MuLong Laboratory of ZTE Corporation. Evaluations demonstrate that FRPGS significantly outperforms state-of-the-art methods in both training efficiency and reconstruction quality, achieving over a 50% (up to 74%) improvement in efficiency on an RTX 4090. FRPGS also supports the 4K resolution reconstruction of 60 frames simultaneously.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9119-9131"},"PeriodicalIF":11.1,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On Modulating Motion-Aware Visual-Language Representation for Few-Shot Action Recognition 调制运动感知的视觉语言表示在少镜头动作识别中的应用
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-02 DOI: 10.1109/TCSVT.2025.3557009
Pengfei Fang;Qiang Xu;Zixuan Lin;Hui Xue
{"title":"On Modulating Motion-Aware Visual-Language Representation for Few-Shot Action Recognition","authors":"Pengfei Fang;Qiang Xu;Zixuan Lin;Hui Xue","doi":"10.1109/TCSVT.2025.3557009","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3557009","url":null,"abstract":"This paper focuses on few-shot action recognition (FSAR), where the machine is required to understand human actions, with each only seeing a few video samples. Even with only a few explorations, the most cutting-edge methods employ the action textual features, pre-trained by a visual-language model (VLM), as a cue to optimize video prototypes. However, the action textual features used in these methods are generated from a static prompt, causing the network to overlook rich motion cues within videos. To tackle this issue, we propose a novel framework, namely, <underline>mo</u>tion-aware <underline>v</u>isual-language r<underline>e</u>presentation modulation <underline>net</u>work (MoveNet). The proposed MoveNet utilizes dynamic motion cues within videos to integrate motion-aware textual and visual feature representations, as a way to modulate the video prototypes. In doing so, a long short motion aggregation module (LSMAM) is first proposed to capture diverse motion cues. Having the motion cues at hand, a motion-conditional prompting module (MCPM) utilizes the motion cues as conditions to boost the semantic associations between textual features and action classes. One further develops a motion-guided visual refinement module (MVRM) that adopts motion cues as guidance in enhancing local frame features. The proposed components compensate for each other and contribute to significant performance gains over the FASR task. Thorough experiments on five standard benchmarks demonstrate the effectiveness of the proposed method, considerably outperforming current state-of-the-art methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8614-8626"},"PeriodicalIF":11.1,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hybrid Siamese Masked Autoencoders as Unsupervised Video Summarizer 混合暹罗蒙面自动编码器作为无监督视频摘要
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-02 DOI: 10.1109/TCSVT.2025.3557254
Yifei Xu;Zaiqiang Wu;Li Li;Siqi Li;Wenlong Li;Mingqi Li;Yuan Rao;Shuiguang Deng
{"title":"Hybrid Siamese Masked Autoencoders as Unsupervised Video Summarizer","authors":"Yifei Xu;Zaiqiang Wu;Li Li;Siqi Li;Wenlong Li;Mingqi Li;Yuan Rao;Shuiguang Deng","doi":"10.1109/TCSVT.2025.3557254","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3557254","url":null,"abstract":"Video summarization aims to seek the most important information from a source video while still retaining its primary content. In practical application, unsupervised video summarizers are acknowledged for their flexibility and superiority without requiring annotated data. However, they are looking for the determined rules on how much each frame is essential enough to be selected as a summary. Unlike conventional frame-based scoring methods, we propose a shot-level unsupervised video summarizer termed Hybrid Siamese Masked Autoencoders (H-SMAE) from a higher semantic perspective. Specifically, our method consists of Multi-view Siamese Masked Autoencoders (MV-SMAE) and Shot Diversity Enhancer (SDE). MV-SMAE tries to recover the masked shots from original frame feature and three unmasked shot subsets with elaborate Siamese masked autoencoders. Inspired by the masking idea in MAE, MV-SMAE introduces a Siamese architecture to model prior references to guide the reconstruction of masked shots. Besides, SDE improves the diversity of generated summary by minimizing the repelling loss among selected shots. Afterward, these two modules are fused followed by 0-1 knapsack algorithm to produce a video summary. Experiments on two challenging and diverse datasets demonstrate that our approach outperforms other state-of-the-art unsupervised and weakly-supervised methods, and even generates comparable results with several excellent supervised methods. The source code of H-SMAE is available at <uri>https://github.com/wzq0214/H-SMAE</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9487-9501"},"PeriodicalIF":11.1,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Language Prompt for Vision-Language Tracking 学习语言提示视觉语言跟踪
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-02 DOI: 10.1109/TCSVT.2025.3557053
Chengao Zong;Jie Zhao;Xin Chen;Huchuan Lu;Dong Wang
{"title":"Learning Language Prompt for Vision-Language Tracking","authors":"Chengao Zong;Jie Zhao;Xin Chen;Huchuan Lu;Dong Wang","doi":"10.1109/TCSVT.2025.3557053","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3557053","url":null,"abstract":"Vision-language object tracking integrates advanced linguistic information, enhancing its robustness and accuracy in complex scenarios. Nevertheless, current methods are constrained by a lack of sufficient vision-language data, making it challenging for the model to learn generalized knowledge. To alleviate this issue, we propose a new prompt-based framework for vision-language tracking, named ProVLT. This framework casts language information as a prompt for pretrained vision-based tracking models, thereby leveraging the knowledge from extensive tracking data. Experiments demonstrate that ProVLT achieves competitive performance while training only a fraction of parameters (approximately 29% of modal parameters). For instance, ProVLT achieves competitive performance, attaining AUC of 59.8% on TNL2K benchmark. Furthermore, we augment five mainstream vision-only tracking benchmarks with language annotations, and find that the inclusion of linguistic information consistently improves tracking performance. On these benchmarks, the linguistic information improves the performance by an average of 2.9% compared with the vision-based tracker. We will release the code, models, and benchmarks for the community.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9287-9299"},"PeriodicalIF":11.1,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信