IEEE Transactions on Multimedia最新文献_第2页

A Cross-Modal Generation Algorithm for Temporal Force Tactile Data for Multidimensional Haptic Rendering 面向多维触觉渲染的时间力触觉数据跨模态生成算法

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-07-21 DOI: 10.1109/TMM.2025.3590907

Rui Song;Guohong Liu;Yan Zhang;Xiaoying Sun

{"title":"A Cross-Modal Generation Algorithm for Temporal Force Tactile Data for Multidimensional Haptic Rendering","authors":"Rui Song;Guohong Liu;Yan Zhang;Xiaoying Sun","doi":"10.1109/TMM.2025.3590907","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590907","url":null,"abstract":"Exploiting the correlation between multimodal data to generate tactile data has become a preferred approach to enhance tactile rendering fidelity. Nevertheless, existing studies have often overlooked the temporal dynamics of force tactile data. To fill this gap in the literature, this paper introduces a joint visual-audio approach to generate a temporal tactile data (VA2T) algorithm, focusing on the temporal and long-term dependencies of force tactile data. VA2T uses a feature extraction network to extract audio and image features and then uses an attention mechanism and decoder to fuse these features. The tactile reconstructor generates temporal friction and a normal force, with dilated causal convolution securing the temporal dependencies in the force tactile data. Simulation experiments on the LMT dataset demonstrate that compared with the transformer and audio-visual-aided haptic signal reconstruction (AVHR) algorithms, the VA2T algorithm reduces the RMSE for generated friction by 29.44% and 32.37%, respectively, and for normal forces by 23.30% and 35.43%, respectively. In addition, we developed a haptic rendering approach that combines electrovibration and mechanical vibration to render the generated friction and normal force. The subjective experimental results showed that the rendering fidelity of the data generated using the VA2T method was significantly higher than that of the data generated using the transformer and AVHR methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5092-5102"},"PeriodicalIF":9.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144914311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spectral Discrepancy and Cross-Modal Semantic Consistency Learning for Object Detection in Hyperspectral Images 光谱差异和跨模态语义一致性学习在高光谱图像目标检测中的应用

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-07-11 DOI: 10.1109/TMM.2025.3586155

Xiao He;Chang Tang;Xinwang Liu;Wei Zhang;Zhimin Gao;Chuankun Li;Shaohua Qiu;Jiangfeng Xu

{"title":"Spectral Discrepancy and Cross-Modal Semantic Consistency Learning for Object Detection in Hyperspectral Images","authors":"Xiao He;Chang Tang;Xinwang Liu;Wei Zhang;Zhimin Gao;Chuankun Li;Shaohua Qiu;Jiangfeng Xu","doi":"10.1109/TMM.2025.3586155","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586155","url":null,"abstract":"Hyperspectral images with high spectral resolution provide new insights into recognizing subtle differences in similar substances. However, object detection in hyperspectral images faces significant challenges in intra- and inter-class similarity due to the spatial differences in hyperspectral inter-bands and unavoidable interferences, e.g., sensor noises and illumination. To alleviate the hyperspectral inter-bands inconsistencies and redundancy, we propose a novel network termed <bold>Spectral <bold>Discrepancy and <bold>Cross-<bold>Modal semantic consistency learning (SDCM), which facilitates the extraction of consistent information across a wide range of hyperspectral bands while utilizing the spectral dimension to pinpoint regions of interest. Specifically, we leverage a semantic consistency learning (SCL) module that utilizes inter-band contextual cues to diminish the heterogeneity of information among bands, yielding highly coherent spectral dimension representations. On the other hand, we incorporate a spectral gated generator (SGG) into the framework that filters out the redundant data inherent in hyperspectral information based on the importance of the bands. Then, we design the spectral discrepancy aware (SDA) module to enrich the semantic representation of high-level information by extracting pixel-level spectral features. Extensive experiments on two hyperspectral datasets demonstrate that our proposed method achieves state-of-the-art performance when compared with other ones.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6719-6731"},"PeriodicalIF":9.7,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Joint Distribution Weighted Alignment for Multi-Source Domain Adaptation via Kernel Relative Entropy Estimation 基于核相对熵估计的多源域自适应联合分布加权对齐

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-07-10 DOI: 10.1109/TMM.2025.3586109

Sentao Chen;Ping Xuan;Zhifeng Hao

{"title":"Joint Distribution Weighted Alignment for Multi-Source Domain Adaptation via Kernel Relative Entropy Estimation","authors":"Sentao Chen;Ping Xuan;Zhifeng Hao","doi":"10.1109/TMM.2025.3586109","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586109","url":null,"abstract":"The objective of Multi-Source Domain Adaptation (MSDA) is to train a neural network on labeled data from multiple joint source distributions (source domains) and unlabeled data from a joint target distribution (target domain), and use the trained network to estimate the target data labels. The challenge in this MSDA problem is that the multiple joint source distributions are relevant but distinct from the joint target distribution. To address this challenge, we propose a Joint Distribution Weighted Alignment (JDWA) approach to align a weighted joint source distribution to the joint target distribution under the relative entropy. Specifically, the weighted joint source distribution is defined as the weighted sum of the multiple joint source distributions, and is parameterized by the relevance weights. Since the relative entropy is unknown in practice, we propose a Kernel Relative Entropy Estimation (KREE) method to estimate it from data. Our KREE method first reformulates relative entropy as the negative of the minimal value of a functional, then exploits a function from the Reproducing Kernel Hilbert Space (RKHS) as the functional’s input, and finally solves the resultant convex problem with a global optimal solution. We also incorporate entropy regularization to enhance the network’s performance. Together, we minimize cross entropy, relative entropy, and entropy to learn both the relevance weights and the neural network. Experimental results on benchmark image classification datasets demonstrate that our JDWA approach performs better than the comparison methods. Intro video and Pytorch code are available at <uri>https://github.com/sentaochen/Joint-Distribution-Weighted-Alignment</uri>. Interested readers are also welcome to visit <uri>https://github.com/sentaochen</uri> for more source codes of the domain adaptation, partial domain adaptation, multi-source domain adaptation, and domain generalization approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6606-6619"},"PeriodicalIF":9.7,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Blockchain and Improved Perception Hash Based Copyright Protection Scheme for Purely Chromatic Background Images 基于区块链和改进感知哈希的纯彩色背景图像版权保护方案

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-07-10 DOI: 10.1109/TMM.2025.3586150

Guangyong Gao;Tongchao Feng;Chongtao Guo;Zhihua Xia;Yun-Qing Shi

{"title":"A Blockchain and Improved Perception Hash Based Copyright Protection Scheme for Purely Chromatic Background Images","authors":"Guangyong Gao;Tongchao Feng;Chongtao Guo;Zhihua Xia;Yun-Qing Shi","doi":"10.1109/TMM.2025.3586150","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586150","url":null,"abstract":"Purely chromatic background images are widely used in computer wallpapers and advertisements, leading to issues such as copyright infringement and the loss of interest of holders. Image hashing is a technique used for comparing the similarity between images, and is often used for image verification, search, and copy detection due to its insensitivity to subtle changes in the original image. In a purely chromatic background image, the central detail of the image is the primary part and the key for copyright authentication. As the perception hash (pHash) algorithm only retains the low-frequency portion of the discrete cosine transform (DCT) matrix, it is unsuitable for purely chromatic background images. To deal with this issue, we propose an improved perception hash (ipHash) algorithm to enhance the universality of the algorithm by extracting purely chromatic background image features. Meanwhile, the development of image hashing is restricted due to the requirement of a trusted third party. To solve this issue, a secure blockchain-based image copyright protection scheme is designed. It realizes the copyright authentication and traceability, and overcomes the issue of a lack of trusted third parties. Experimental results show that the proposed method outperforms the state-of-the-art image copyright protection schemes.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6635-6647"},"PeriodicalIF":9.7,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

S4R: Rethinking Point Cloud Sampling via Guiding Upsampling-Aware Perception S4R：基于上采样感知引导的点云采样重新思考

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-07-10 DOI: 10.1109/TMM.2025.3586148

Zhuangzi Li;Shan Liu;Wei Gao;Guanbin Li;Ge Li

{"title":"S4R: Rethinking Point Cloud Sampling via Guiding Upsampling-Aware Perception","authors":"Zhuangzi Li;Shan Liu;Wei Gao;Guanbin Li;Ge Li","doi":"10.1109/TMM.2025.3586148","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586148","url":null,"abstract":"Point cloud sampling aims to derive a sparse point cloud from a relatively dense point cloud, which is essential for efficient data transmission and storage. While existing deep sampling methods prioritize preserving the perception of sampled point clouds for downstream networks, few studies have critically examined the rationale behind this goal. Specifically, we observe that sampling can lead to a perceptual degradation phenomenon in many influential downstream networks, impairing their ability to effectively process sampled point clouds. We theoretically reveal the nature of the phenomenon and attempt to construct a novel sampling target by uniting upsampling and perceptual reconstruction. Accordingly, we propose a Maximum A Posteriori (MAP) sampling framework named Sample for Reconstruct (S4R), which impels the sampling stage to infer upsampling-guided perception. In S4R, we design very simple but effective sampling and upsampling networks using residual-based graph convolutions and incorporate a pseudo-residual connection to introduce prior knowledge. This architecture takes advantage of reconstruction properties and allows the sampling network to be trained in an unsupervised manner. Extensive experiments on classical networks demonstrates the excellent performance of S4R compared with the previous sampling schemes and reveals its advantages on different point cloud downstream tasks, i.e., classification, reconstruction and segmentation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6677-6689"},"PeriodicalIF":9.7,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Synthesizing Multi-Person and Rare Pose Images for Human Pose Estimation 合成多人和稀有姿态图像用于人体姿态估计

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-07-04 DOI: 10.1109/TMM.2025.3586122

Liuqing Zhao;Zichen Tian;Peng Zou;Richang Hong;Qianru Sun

{"title":"Synthesizing Multi-Person and Rare Pose Images for Human Pose Estimation","authors":"Liuqing Zhao;Zichen Tian;Peng Zou;Richang Hong;Qianru Sun","doi":"10.1109/TMM.2025.3586122","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586122","url":null,"abstract":"Human pose estimation (HPE) models underperform in recognizing rare poses because they suffer from data imbalance problems (i.e., there are few image samples for rare poses) in their training datasets. From a data perspective, the most intuitive solution is to synthesize data for rare poses. Specifically, the rule-based methods apply manual manipulations (such as Cutout and GridMask) to the existing data, so the limited diversity of the data constrains the model. An alternative method is to learn the underlying data distribution via deep generative models (such as ControlNet and HumanSD) and then sample “new data” from the distribution. This works well for generating frequent poses in common scenes, but suffers when applied to rare poses or complex scenes (such as multiple persons with overlapping limbs). In this paper, we aim to address the above two issues, i.e., rare poses and complex scenes, for person image generation. We propose a two-stage method. In the first stage, we design a controllable pose generator named PoseFactory to synthesize rare poses. This generator is specifically trained on augmented pose data, and each pose is labelled with its level of difficulty and rarity. In the second stage, we introduce a multi-person image generator named MultipGenerator. It is conditioned on multiple human poses and textual descriptions of complex scenes. Both stages are controllable in terms of the diversity of poses and the complexity of scenes. For evaluation, we conduct extensive experiments on three widely used datasets: MS-COCO, HumanArt, and OCHuman. We compare our method against traditional pose data augmentation and person image generation methods, and it demonstrates its superior performance both quantitatively and qualitatively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6568-6580"},"PeriodicalIF":9.7,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hierarchical Multi-Prototype Discrimination: Boosting Support-Query Matching for Few-Shot Segmentation 分层多原型判别：促进支持查询匹配的少镜头分割

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-07-04 DOI: 10.1109/TMM.2025.3586125

Wenbo Xu;Huaxi Huang;Yongshun Gong;Litao Yu;Qiang Wu;Jian Zhang

{"title":"Hierarchical Multi-Prototype Discrimination: Boosting Support-Query Matching for Few-Shot Segmentation","authors":"Wenbo Xu;Huaxi Huang;Yongshun Gong;Litao Yu;Qiang Wu;Jian Zhang","doi":"10.1109/TMM.2025.3586125","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586125","url":null,"abstract":"Few-shot segmentation (FSS) aims at training a model on base classes with sufficient annotations and then tasking the model with predicting a binary mask to identify novel class pixels with limited labeled images. Mainstream FSS methods adopt a support-query matching paradigm that activates target regions of the query image according to their similarity with a single support class prototype. However, this prototype vector is inclined to overfit the support images, leading to potential under-matching in latent query object regions and incorrect mismatches with base class features in the query image. To address these issues, this study reformulates conventional single foreground prototype matching to a multi-prototype matching paradigm. In this paradigm, query features exhibiting high confidence with non-target prototypes will be categorized as background. Specifically, the target query features are drawn closer to the novel class prototype through a Masked Cross-Image Encoding (MCE) module and a Semantic Multi-prototype Matching (SMM) module is incorporated to collaboratively filter unexpected base class regions on multi-scale features. Furthermore, we devise an adaptive class activation map, termed target-aware class activation map (TCAM) to preserve semantically coherent regions that might be inadvertently suppressed under pixel-wise matching guidance. Experimental results on PASCAL-5<inline-formula><tex-math>$^{i}$</tex-math></inline-formula> and COCO-20<inline-formula><tex-math>$^{i}$</tex-math></inline-formula> datasets demonstrate the advantage of the proposed novel modules, with the holistic approach outperforming compared state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6705-6718"},"PeriodicalIF":9.7,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MossVLN: Memory-Observation Synergistic System for Continuous Vision-Language Navigation 连续视觉语言导航的记忆-观察协同系统

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-07-04 DOI: 10.1109/TMM.2025.3586105

Ting Yu;Yifei Wu;Qiongjie Cui;Qingming Huang;Jun Yu

{"title":"MossVLN: Memory-Observation Synergistic System for Continuous Vision-Language Navigation","authors":"Ting Yu;Yifei Wu;Qiongjie Cui;Qingming Huang;Jun Yu","doi":"10.1109/TMM.2025.3586105","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586105","url":null,"abstract":"Navigating in continuous environments with vision-language cues presents critical challenges, particularly in the accuracy of waypoint prediction and the quality of navigation decision-making. Traditional methods, which predominantly rely on spatial data from depth images or straightforward RGB-depth integrations, frequently encounter difficulties in environments where waypoints share similar spatial characteristics, leading to erroneous navigational outcomes. Additionally, the capacity for effective navigation decisions is often hindered by the inadequacies of traditional topological maps and the issue of uneven data sampling. In response, this paper introduces a robust memory-observation synergistic vision-language navigation framework to substantially enhance the navigation capabilities of agents operating in continuous environments. We present an advanced observation-driven waypoint predictor that effectively utilizes spatial data and integrates aligned visual and textual cues to significantly improve the accuracy of waypoint predictions within complex real-world scenarios. Additionally, we develop a strategic memory-observation planning approach that leverages memory panoramic environmental data and detailed current observation information, enabling more informed and precise navigation decisions. Our framework sets new performance benchmarks on the VLN-CE dataset, achieving a 60.25% Success Rate (SR) and a 50.89% Path Length Score (SPL) on the R2R-CE dataset’s unseen validation splits. Furthermore, when adapted to a discrete environment, our model also shows exceptional performance on the R2R dataset, achieving a 74% SR and a 64% SPL on the unseen validation split.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6690-6704"},"PeriodicalIF":9.7,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pathology-Preserving Transformer Based on Multicolor Space for Low-Quality Medical Image Enhancement 基于多色空间的低质量医学图像增强病理保持变压器

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-07-04 DOI: 10.1109/TMM.2025.3586133

Qingshan Hou;Yaqi Wang;Peng Cao;Jianguo Ju;Huijuan Tu;Xiaoli Liu;Jinzhu Yang;Huazhu Fu;Yih Chung Tham;Osmar R. Zaiane

{"title":"Pathology-Preserving Transformer Based on Multicolor Space for Low-Quality Medical Image Enhancement","authors":"Qingshan Hou;Yaqi Wang;Peng Cao;Jianguo Ju;Huijuan Tu;Xiaoli Liu;Jinzhu Yang;Huazhu Fu;Yih Chung Tham;Osmar R. Zaiane","doi":"10.1109/TMM.2025.3586133","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586133","url":null,"abstract":"Medical images acquired under suboptimal conditions often suffer from quality degradation, such as low-light, blurring, and artifacts. Such degradations obscure the lesions and anatomical structures in medical images, making it difficult to distinguish key pathological regions. This significantly increases the risk of misdiagnosis by automated medical diagnostic systems or clinicians. To address this challenge, we propose a multi-Color space-based quality enhancement network (MSQNet) that effectively eliminates global low-quality factors while preserving pathology-related characteristics for improved clinical observation and analysis. We first revisit the properties of image quality enhancement in different color spaces, where the V-channel in the HSV space can better represent the contrast and brightness enhancement process, whereas the A/B-channel in the LAB space is more focused on the color change of low-quality images. The proposed framework harnesses the unique properties of different color spaces to optimize the image enhancement process. Specifically, we propose a pathology-preserving transformer, designed to selectively aggregate features across different color spaces and enable comprehensive multiscale feature fusion. Leveraging these capabilities, MSQNet effectively enhances low-quality RGB medical images while preserving key pathological features, thereby establishing a new paradigm in medical image enhancement. Extensive experiments on three public medical image datasets demonstrate that MSQNet outperforms traditional enhancement techniques and state-of-the-art methods, in terms of both quantitative metrics and qualitative visual assessment. MSQNet successfully improves image quality while preserving pathological features and anatomical structures, facilitating accurate diagnosis and analysis by medical professionals and automated systems.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6661-6676"},"PeriodicalIF":9.7,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition MVP-Shot：用于少射动作识别的多速度渐进式对齐框架

IF 9.7 1区计算机科学

IEEE Transactions on Multimedia Pub Date : 2025-07-04 DOI: 10.1109/TMM.2025.3586118

Hongyu Qu;Rui Yan;Xiangbo Shu;Hailiang Gao;Peng Huang;Guosen Xie

{"title":"MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition","authors":"Hongyu Qu;Rui Yan;Xiangbo Shu;Hailiang Gao;Peng Huang;Guosen Xie","doi":"10.1109/TMM.2025.3586118","DOIUrl":"https://doi.org/10.1109/TMM.2025.3586118","url":null,"abstract":"Recent few-shot action recognition (FSAR) methods typically perform semantic matching on learned discriminative features to achieve promising performance. However, most FSAR methods focus on single-scale (e.g., frame-level, segment-level, etc.) feature alignment, which ignores that human actions with the same semantic may appear at different velocities. To this end, we develop a novel Multi-Velocity Progressive-alignment (MVP-Shot) framework to progressively learn and align semantic-related action features at multi-velocity levels. Concretely, a Multi-Velocity Feature Alignment (MVFA) module is designed to measure the similarity between features from support and query videos with different velocity scales and then merge all similarity scores in a residual fashion. To avoid the multiple velocity features deviating from the underlying motion semantic, our proposed Progressive Semantic-Tailored Interaction (PSTI) module injects velocity-tailored text information into the video feature via feature interaction on channel and temporal domains at different velocities. The above two modules compensate for each other to make more accurate query sample predictions under the few-shot settings. Experimental results show our method outperforms current state-of-the-art methods on multiple standard few-shot benchmarks (<italic>i.e., HMDB51, UCF101, Kinetics, SSv2-full, and SSv2-small).","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6593-6605"},"PeriodicalIF":9.7,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0