IEEE Transactions on Circuits and Systems for Video Technology最新文献

筛选
英文 中文
SHAA: Spatial Hybrid Attention Network With Adaptive Cross-Entropy Loss Function for UAV-View Geo-Localization 基于自适应交叉熵损失函数的空间混合注意网络
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-15 DOI: 10.1109/TCSVT.2025.3560637
Nanhua Chen;Dongshuo Zhang;Kai Jiang;Meng Yu;Yeqing Zhu;Tai-Shan Lou;Liangyu Zhao
{"title":"SHAA: Spatial Hybrid Attention Network With Adaptive Cross-Entropy Loss Function for UAV-View Geo-Localization","authors":"Nanhua Chen;Dongshuo Zhang;Kai Jiang;Meng Yu;Yeqing Zhu;Tai-Shan Lou;Liangyu Zhao","doi":"10.1109/TCSVT.2025.3560637","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3560637","url":null,"abstract":"Cross-view geo-localization provides an offline visual positioning strategy for unmanned aerial vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, it still faces the following challenges, leading to suboptimal localization performance: 1) Existing methods primarily focus on extracting global features or local features by partitioning feature maps, neglecting the exploration of spatial information, which is essential for extracting consistent feature representations and aligning images of identical targets across different views. 2) Cross-view geo-localization encounters the challenge of data imbalance between UAV and satellite images. To address these challenges, the Spatial Hybrid Attention Network with Adaptive Cross-Entropy Loss Function (SHAA) is proposed. To tackle the first issue, the Spatial Hybrid Attention (SHA) method employs a Spatial Shift-MLP (SSM) to focus on the spatial geometric correspondences in feature maps across different views, extracting both global features and fine-grained features. Additionally, the SHA method utilizes a Hybrid Attention (HA) mechanism to enhance feature extraction diversity and robustness by capturing interactions between spatial and channel dimensions, thereby extracting consistent cross-view features and aligning images. For the second challenge, the Adaptive Cross-Entropy (ACE) loss function incorporates adaptive weights to emphasize hard samples, alleviating data imbalance issues and improving training effectiveness. Extensive experiments on widely recognized benchmarks, including University-1652, SUES-200, and DenseUAV, demonstrate that SHAA achieves state-of-the-art performance, outperforming existing methods by over 3.92%. Code will be released at: <uri>https://github.com/chennanhua001/SHAA</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9398-9413"},"PeriodicalIF":11.1,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Keypoints and Action Units Jointly Drive Talking Head Generation for Video Conferencing 关键点和动作单元共同驱动视频会议的说话头生成
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-14 DOI: 10.1109/TCSVT.2025.3560369
Wuzhen Shi;Zibang Xue;Yang Wen
{"title":"Keypoints and Action Units Jointly Drive Talking Head Generation for Video Conferencing","authors":"Wuzhen Shi;Zibang Xue;Yang Wen","doi":"10.1109/TCSVT.2025.3560369","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3560369","url":null,"abstract":"This paper introduces a high-quality talking head generation method that is jointly driven by keypoints and action units, aiming to strike a balance between low-bandwidth transmission and high-quality generation in video conference scenarios. Existing methods for talking head generation often face limitations: they either require an excessive amount of driving information or struggle with accuracy and quality when adapted to low-bandwidth conditions. To address this, we decompose the talking head generation task into two components: a driving task, focused on information-limited control, and an enhancement task, aimed at achieving high-quality, high-definition output. Our proposed method innovatively incorporates the joint driving of keypoints and action units, improving the accuracy of pose and expression generation while remaining suitable for low-bandwidth environments. Furthermore, we implement a multi-step video quality enhancement process, targeting both the entire frame and key regions, while incorporating temporal consistency constraints. By leveraging attention mechanisms, we enhance the realism of the challenging-to-generate mouth regions and mitigate background jitter through background fusion. Finally, a prior-driven super-resolution network is employed to achieve high-quality display. Extensive experiments demonstrate that our method effectively supports low-resolution recording, low-bandwidth transmission, and high-definition display.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8692-8706"},"PeriodicalIF":11.1,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Visible-Infrared Person Re-Identification With Modality- and Instance-Aware Adaptation Learning 基于模态和实例感知的自适应学习增强可见红外人再识别
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-11 DOI: 10.1109/TCSVT.2025.3560118
Ruiqi Wu;Bingliang Jiao;Meng Liu;Shining Wang;Wenxuan Wang;Peng Wang
{"title":"Enhancing Visible-Infrared Person Re-Identification With Modality- and Instance-Aware Adaptation Learning","authors":"Ruiqi Wu;Bingliang Jiao;Meng Liu;Shining Wang;Wenxuan Wang;Peng Wang","doi":"10.1109/TCSVT.2025.3560118","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3560118","url":null,"abstract":"The Visible-Infrared Person Re-identification (VI ReID) aims to achieve cross-modality re-identification by matching pedestrian images from visible and infrared illumination. A crucial challenge in this task is mitigating the impact of modality divergence to enable the VI ReID model to learn cross-modality correspondence. Regarding this challenge, existing methods primarily focus on eliminating the information gap between different modalities by extracting modality-invariant information or supplementing inputs with specific information from another modality. However, these methods may overly focus on bridging the information gap, a challenging issue that could potentially overshadow the inherent complexities of cross-modality ReID itself. Based on this insight, we propose a straightforward yet effective strategy to empower the VI ReID model with sufficient flexibility to adapt diverse modality inputs to achieve cross-modality ReID effectively. Specifically, we introduce a Modality-aware and Instance-aware Visual Prompts (MIP) network, leveraging transformer architecture with customized visual prompts. In our MIP, a set of modality-aware prompts is designed to enable our model to dynamically adapt diverse modality inputs and effectively extract information for identification, thereby alleviating the interference of modality divergence. Besides, we also propose the instance-aware prompts, which are responsible for guiding the model to adapt individual pedestrians and capture discriminative clues for accurate identification. Through extensive experiments on four mainstream VI ReID datasets, the effectiveness of our designed modules is evaluated. Furthermore, our proposed MIP network outperforms most current state-of-the-art methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 8","pages":"8086-8103"},"PeriodicalIF":11.1,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144781983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatial-Aware Conformal Prediction for Trustworthy Hyperspectral Image Classification 可信赖高光谱图像分类的空间感知保形预测
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-10 DOI: 10.1109/TCSVT.2025.3558753
Kangdao Liu;Tianhao Sun;Hao Zeng;Yongshan Zhang;Chi-Man Pun;Chi-Man Vong
{"title":"Spatial-Aware Conformal Prediction for Trustworthy Hyperspectral Image Classification","authors":"Kangdao Liu;Tianhao Sun;Hao Zeng;Yongshan Zhang;Chi-Man Pun;Chi-Man Vong","doi":"10.1109/TCSVT.2025.3558753","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558753","url":null,"abstract":"Hyperspectral image (HSI) classification involves assigning unique labels to each pixel to identify various land cover categories. While deep classifiers have achieved high predictive accuracy in this field, they lack the ability to rigorously quantify confidence in their predictions. This limitation restricts their application in critical contexts where the cost of prediction errors is significant, as quantifying the uncertainty of model predictions is crucial for the safe deployment of predictive models. To address this limitation, a rigorous theoretical proof is presented first, which demonstrates the validity of Conformal Prediction, an emerging uncertainty quantification technique, in the context of HSI classification. Building on this foundation, a conformal procedure is designed to equip any pre-trained HSI classifier with trustworthy prediction sets, ensuring that the true labels are included with a user-defined probability (e.g., 95%). Furthermore, a novel framework of Conformal Prediction specifically designed for HSI data, called Spatial-Aware Conformal Prediction ( <monospace>SACP</monospace> ), is proposed. This framework integrates essential spatial information of HSI by aggregating the non-conformity scores of pixels with high spatial correlation, effectively improving the statistical efficiency of prediction sets. Both theoretical and empirical results validate the effectiveness of the proposed approaches. The source code is available at <uri>https://github.com/J4ckLiu/SACP</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8754-8766"},"PeriodicalIF":11.1,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VLF-SAR: A Novel Vision-Language Framework for Few-Shot SAR Target Recognition VLF-SAR:一种新的多镜头SAR目标识别视觉语言框架
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-10 DOI: 10.1109/TCSVT.2025.3558801
Nishang Xie;Tao Zhang;Lanyu Zhang;Jie Chen;Feiming Wei;Wenxian Yu
{"title":"VLF-SAR: A Novel Vision-Language Framework for Few-Shot SAR Target Recognition","authors":"Nishang Xie;Tao Zhang;Lanyu Zhang;Jie Chen;Feiming Wei;Wenxian Yu","doi":"10.1109/TCSVT.2025.3558801","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558801","url":null,"abstract":"Due to the challenges of obtaining data from valuable targets, few-shot learning plays a critical role in synthetic aperture radar (SAR) target recognition. However, the high noise levels and complex backgrounds inherent in SAR data make this technology difficult to implement. To improve the recognition accuracy, in this paper, we propose a novel vision-language framework, VLF-SAR, with two specialized models: VLF-SAR-P for polarimetric SAR (PolSAR) data and VLF-SAR-T for traditional SAR data. Both models start with a frequency embedded module (FEM) to generate key structural features. For VLF-SAR-P, a polarimetric feature selector (PFS) is further introduced to identify the most relevant polarimetric features. Also, a novel adaptive multimodal triple attention mechanism (AMTAM) is designed to facilitate dynamic interactions between different kinds of features. For VLF-SAR-T, after FEM, a multimodal fusion attention mechanism (MFAM) is correspondingly proposed to fuse and adapt information extracted from frozen contrastive language-image pre-training (CLIP) encoders across different modalities. Extensive experiments on the OpenSARShip2.0, FUSAR-Ship, and SAR-AirCraft-1.0 datasets demonstrate the superiority of VLF-SAR over some state-of-the-art methods, offering a promising approach for few-shot SAR target recognition.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9530-9544"},"PeriodicalIF":11.1,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prior Knowledge-Driven Hybrid Prompter Learning for RGB-Event Tracking 基于先验知识驱动的rgb -事件跟踪混合提示学习
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-10 DOI: 10.1109/TCSVT.2025.3559614
Mianzhao Wang;Fan Shi;Xu Cheng;Shengyong Chen
{"title":"Prior Knowledge-Driven Hybrid Prompter Learning for RGB-Event Tracking","authors":"Mianzhao Wang;Fan Shi;Xu Cheng;Shengyong Chen","doi":"10.1109/TCSVT.2025.3559614","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3559614","url":null,"abstract":"Event data can asynchronously capture variations in light intensity, thereby implicitly providing valuable complementary cues for RGB-Event tracking. Existing methods typically employ a direct interaction mechanism to fuse RGB and event data. However, due to differences in imaging mechanisms, the representational disparity between these two data types is not fixed, which can lead to tracking failures in certain challenging scenarios. To address this issue, we propose a novel prior knowledge-driven hybrid prompter learning framework for RGB-Event tracking. Specifically, we develop a frame-event hybrid prompter that leverages prior tracking knowledge from the foundation model as intermediate modal support to mitigate the heterogeneity between RGB and event data. By leveraging its rich prior tracking knowledge, the intermediate modal reduces the gap between the dense RGB and sparse event data interactions, effectively guiding complementary learning between modalities. Meanwhile, to mitigate the internal learning disparities between the lightweight hybrid prompter and the deep transformer model, we introduce a pseudo-prompt learning strategy that lies between full fine-tuning and partial fine-tuning. This strategy adopts a divide-and-conquer approach to assign different learning rates to modules with distinct functions, effectively reducing the dominant influence of RGB information in complex scenarios. Extensive experiments conducted on two public RGB-Event tracking datasets show that the proposed HPL outperforms state-of-the-art tracking methods, achieving exceptional performance.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8679-8691"},"PeriodicalIF":11.1,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vision-Language Adaptive Clustering and Meta-Adaptation for Unsupervised Few-Shot Action Recognition 无监督少镜头动作识别的视觉语言自适应聚类和元适应
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-09 DOI: 10.1109/TCSVT.2025.3558785
Jiaxin Chen;Jiawen Peng;Yanzuo Lu;Jian-Huang Lai;Andy J. Ma
{"title":"Vision-Language Adaptive Clustering and Meta-Adaptation for Unsupervised Few-Shot Action Recognition","authors":"Jiaxin Chen;Jiawen Peng;Yanzuo Lu;Jian-Huang Lai;Andy J. Ma","doi":"10.1109/TCSVT.2025.3558785","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558785","url":null,"abstract":"Unsupervised few-shot action recognition is a practical but challenging task, which adapts knowledge learned from unlabeled videos to novel action classes with only limited labeled data. Without annotated data of base action classes for meta-learning, it cannot achieve satisfactory performance due to the low-quality pseudo-classes and episodes. Though vision-language pre-training models such as CLIP can be employed to improve the quality of pseudo-classes and episodes, the performance improvements may still be limited by using only the visual encoder in the absence of textual modality information. In this paper, we propose fully exploiting the multimodal knowledge of a pre-trained vision-language model CLIP in a novel framework for unsupervised video meta-learning. Textual modality is automatically generated for each unlabeled video by a video-to-text transformer. Multimodal adaptive clustering for episodic sampling (MACES) based on a video-text ensemble distance metric is proposed to accurately estimate pseudo-classes, which constructs high-quality few-shot tasks (episodes) for episodic training. Vision-language meta-adaptation (VLMA) is designed for adapting the pre-trained model to novel tasks by category-aware vision-language contrastive learning and confidence-based reliable bidirectional knowledge distillation. The final prediction is obtained by multimodal adaptive inference. Extensive experiments on five benchmarks demonstrate the superiority of our method for unsupervised few-shot action recognition.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9246-9260"},"PeriodicalIF":11.1,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-BSR: Self-Supervised Image Denoising and Destriping Based on Blind-Spot Regularization 基于盲点正则化的自监督图像去噪和去条带
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-09 DOI: 10.1109/TCSVT.2025.3559214
Chao Qu;Zewei Chen;Jingyuan Zhang;Xiaoyu Chen;Jing Han
{"title":"Self-BSR: Self-Supervised Image Denoising and Destriping Based on Blind-Spot Regularization","authors":"Chao Qu;Zewei Chen;Jingyuan Zhang;Xiaoyu Chen;Jing Han","doi":"10.1109/TCSVT.2025.3559214","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3559214","url":null,"abstract":"Digital images captured by unstable imaging systems often simultaneously suffer from random noise and stripe noise. Due to the complex noise distribution, denoising and destriping methods based on simple handcrafted priors may leave residual noise. Although supervised methods have achieved some progress, they rely on large-scale noisy-clean image pairs, which are challenging to obtain in practice. To address these problems, we propose a self-supervised image denoising and destriping method based on blind-spot regularization, named Self-BSR. This method transforms the overall denoising and destriping problem into a modeling task for two spatially correlated signals: image and stripe. Specifically, blind-spot regularization leverages spatial continuity learned by the improved blind-spot network to separately constrain the reconstruction of image and stripe while suppressing pixel-wise independent noise. This regularization has two advantages: first, it is adaptively formulated based on implicit network priors, without any explicit parametric modeling of image and noise; second, it enables Self-BSR to learn denoising and destriping only from noisy images. In addition, we introduce the directional feature unshuffle in Self-BSR, which extracts multi-directional information to provide discriminative features for separating image from stripe. Furthermore, the feature-resampling refinement is proposed to improve the reconstruction ability of Self-BSR by resampling pixels with high spatial correlation in the receptive field. Extensive experiments on synthetic and real-world datasets demonstrate significant advantages of the proposed method over existing methods in denoising and destriping performance. The code will be publicly available at <uri>https://github.com/Jocobqc/Self-BSR</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8666-8678"},"PeriodicalIF":11.1,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EDTformer: An Efficient Decoder Transformer for Visual Place Recognition EDTformer:一种用于视觉位置识别的高效解码器转换器
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-09 DOI: 10.1109/TCSVT.2025.3559084
Tong Jin;Feng Lu;Shuyu Hu;Chun Yuan;Yunpeng Liu
{"title":"EDTformer: An Efficient Decoder Transformer for Visual Place Recognition","authors":"Tong Jin;Feng Lu;Shuyu Hu;Chun Yuan;Yunpeng Liu","doi":"10.1109/TCSVT.2025.3559084","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3559084","url":null,"abstract":"Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer, and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability to capture contextual dependencies and generate accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly produce robust and discriminative global representations. Specifically, we do this by formulating deep features as the keys and values, as well as a set of learnable parameters as the queries. Our EDTformer can fully utilize the contextual information within deep features, then gradually decode and aggregate the effective features into the learnable queries to output the global representations. Moreover, to provide more powerful deep features for EDTformer and further facilitate the robustness, we use the foundation model DINOv2 as the backbone and propose a Low-rank Parallel Adaptation (LoPA) method to enhance its performance in VPR, which can refine the intermediate features of the backbone progressively in a memory- and parameter-efficient way. As a result, our method not only outperforms single-stage VPR methods on multiple benchmark datasets, but also outperforms two-stage VPR methods which add a re-ranking with considerable cost. Code will be available at <uri>https://github.com/Tong-Jin01/EDTformer</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8835-8848"},"PeriodicalIF":11.1,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Single-Server Private Inference Outsourcing for Convolutional Neural Networks 卷积神经网络的高效单服务器私有推理外包
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-09 DOI: 10.1109/TCSVT.2025.3559101
Xuanang Yang;Jing Chen;Yuqing Li;Kun He;Xiaojie Huang;Zikuan Jiang;Ruiying Du;Hao Bai
{"title":"Efficient Single-Server Private Inference Outsourcing for Convolutional Neural Networks","authors":"Xuanang Yang;Jing Chen;Yuqing Li;Kun He;Xiaojie Huang;Zikuan Jiang;Ruiying Du;Hao Bai","doi":"10.1109/TCSVT.2025.3559101","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3559101","url":null,"abstract":"Private inference outsourcing ensures the privacy of both clients and model owners when model owners deliver inference services to clients through third-party cloud servers. Existing solutions either reduce inference accuracy due to model approximations or rely on the unrealistic assumption of non-colluding servers. Moreover, their efficiency falls short of HELiKs, a solution focused solely on client privacy protection. In this paper, we propose Skybolt, a single-server private inference outsourcing framework without resorting to model approximations, achieving greater efficiency than HELiKs. Skybolt is built upon efficient secure two-party computation protocols that safeguard the privacy of both clients and model owners. For the linear calculation protocol, we devise a ciphertext packing algorithm for homomorphic matrix multiplication, effectively reducing both computational and communication overheads. Additionally, our nonlinear calculation protocol features a lightweight online phase, involving only the addition and multiplication on secret shares. This stands in contrast to existing protocols, which entail resource-intensive techniques such as oblivious transfer. Extensive experiments on popular models, including ResNet50 and DenseNet121, show that Skybolt achieves a <inline-formula> <tex-math>$5.4-7.3 times $ </tex-math></inline-formula> reduction in inference latency, accompanied by a <inline-formula> <tex-math>$20.1-39.6 times $ </tex-math></inline-formula> decrease in communication cost compared to HELiKs.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 10","pages":"10586-10598"},"PeriodicalIF":11.1,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信