arXiv - CS - Computer Vision and Pattern Recognition最新文献_第3页

RopeBEV: A Multi-Camera Roadside Perception Network in Bird's-Eye-View RopeBEV：鸟瞰式多摄像头路边感知网络

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11706

Jinrang Jia, Guangqi Yi, Yifeng Shi

{"title":"RopeBEV: A Multi-Camera Roadside Perception Network in Bird's-Eye-View","authors":"Jinrang Jia, Guangqi Yi, Yifeng Shi","doi":"arxiv-2409.11706","DOIUrl":"https://doi.org/arxiv-2409.11706","url":null,"abstract":"Multi-camera perception methods in Bird's-Eye-View (BEV) have gained wide\u0000application in autonomous driving. However, due to the differences between\u0000roadside and vehicle-side scenarios, there currently lacks a multi-camera BEV\u0000solution in roadside. This paper systematically analyzes the key challenges in\u0000multi-camera BEV perception for roadside scenarios compared to vehicle-side.\u0000These challenges include the diversity in camera poses, the uncertainty in\u0000Camera numbers, the sparsity in perception regions, and the ambiguity in\u0000orientation angles. In response, we introduce RopeBEV, the first dense\u0000multi-camera BEV approach. RopeBEV introduces BEV augmentation to address the\u0000training balance issues caused by diverse camera poses. By incorporating\u0000CamMask and ROIMask (Region of Interest Mask), it supports variable camera\u0000numbers and sparse perception, respectively. Finally, camera rotation embedding\u0000is utilized to resolve orientation ambiguity. Our method ranks 1st on the\u0000real-world highway dataset RoScenes and demonstrates its practical value on a\u0000private urban dataset that covers more than 50 intersections and 600 cameras.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SpheriGait: Enriching Spatial Representation via Spherical Projection for LiDAR-based Gait Recognition SpheriGait：通过基于激光雷达的步态识别的球面投影丰富空间表示

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11869

Yanxi Wang, Zhigang Chang, Chen Wu, Zihao Cheng, Hongmin Gao

{"title":"SpheriGait: Enriching Spatial Representation via Spherical Projection for LiDAR-based Gait Recognition","authors":"Yanxi Wang, Zhigang Chang, Chen Wu, Zihao Cheng, Hongmin Gao","doi":"arxiv-2409.11869","DOIUrl":"https://doi.org/arxiv-2409.11869","url":null,"abstract":"Gait recognition is a rapidly progressing technique for the remote\u0000identification of individuals. Prior research predominantly employing 2D\u0000sensors to gather gait data has achieved notable advancements; nonetheless,\u0000they have unavoidably neglected the influence of 3D dynamic characteristics on\u0000recognition. Gait recognition utilizing LiDAR 3D point clouds not only directly\u0000captures 3D spatial features but also diminishes the impact of lighting\u0000conditions while ensuring privacy protection.The essence of the problem lies in\u0000how to effectively extract discriminative 3D dynamic representation from point\u0000clouds.In this paper, we proposes a method named SpheriGait for extracting and\u0000enhancing dynamic features from point clouds for Lidar-based gait recognition.\u0000Specifically, it substitutes the conventional point cloud plane projection\u0000method with spherical projection to augment the perception of dynamic\u0000feature.Additionally, a network block named DAM-L is proposed to extract gait\u0000cues from the projected point cloud data. We conducted extensive experiments\u0000and the results demonstrated the SpheriGait achieved state-of-the-art\u0000performance on the SUSTech1K dataset, and verified that the spherical\u0000projection method can serve as a universal data preprocessing technique to\u0000enhance the performance of other LiDAR-based gait recognition methods,\u0000exhibiting exceptional flexibility and practicality.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Latent fingerprint enhancement for accurate minutiae detection 增强潜伏指纹，实现精确的细节检测

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11802

Abdul Wahab, Tariq Mahmood Khan, Shahzaib Iqbal, Bandar AlShammari, Bandar Alhaqbani, Imran Razzak

{"title":"Latent fingerprint enhancement for accurate minutiae detection","authors":"Abdul Wahab, Tariq Mahmood Khan, Shahzaib Iqbal, Bandar AlShammari, Bandar Alhaqbani, Imran Razzak","doi":"arxiv-2409.11802","DOIUrl":"https://doi.org/arxiv-2409.11802","url":null,"abstract":"Identification of suspects based on partial and smudged fingerprints,\u0000commonly referred to as fingermarks or latent fingerprints, presents a\u0000significant challenge in the field of fingerprint recognition. Although\u0000fixed-length embeddings have shown effectiveness in recognising rolled and slap\u0000fingerprints, the methods for matching latent fingerprints have primarily\u0000centred around local minutiae-based embeddings, failing to fully exploit global\u0000representations for matching purposes. Consequently, enhancing latent\u0000fingerprints becomes critical to ensuring robust identification for forensic\u0000investigations. Current approaches often prioritise restoring ridge patterns,\u0000overlooking the fine-macroeconomic details crucial for accurate fingerprint\u0000recognition. To address this, we propose a novel approach that uses generative\u0000adversary networks (GANs) to redefine Latent Fingerprint Enhancement (LFE)\u0000through a structured approach to fingerprint generation. By directly optimising\u0000the minutiae information during the generation process, the model produces\u0000enhanced latent fingerprints that exhibit exceptional fidelity to ground-truth\u0000instances. This leads to a significant improvement in identification\u0000performance. Our framework integrates minutiae locations and orientation\u0000fields, ensuring the preservation of both local and structural fingerprint\u0000features. Extensive evaluations conducted on two publicly available datasets\u0000demonstrate our method's dominance over existing state-of-the-art techniques,\u0000highlighting its potential to significantly enhance latent fingerprint\u0000recognition accuracy in forensic applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SPRMamba: Surgical Phase Recognition for Endoscopic Submucosal Dissection with Mamba SPRMamba：使用 Mamba 进行内窥镜粘膜下剥离的手术阶段识别

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.12108

Xiangning Zhang, Jinnan Chen, Qingwei Zhang, Chengfeng Zhou, Zhengjie Zhang, Xiaobo Li, Dahong Qian

{"title":"SPRMamba: Surgical Phase Recognition for Endoscopic Submucosal Dissection with Mamba","authors":"Xiangning Zhang, Jinnan Chen, Qingwei Zhang, Chengfeng Zhou, Zhengjie Zhang, Xiaobo Li, Dahong Qian","doi":"arxiv-2409.12108","DOIUrl":"https://doi.org/arxiv-2409.12108","url":null,"abstract":"Endoscopic Submucosal Dissection (ESD) is a minimally invasive procedure\u0000initially designed for the treatment of early gastric cancer but is now widely\u0000used for various gastrointestinal lesions. Computer-assisted Surgery systems\u0000have played a crucial role in improving the precision and safety of ESD\u0000procedures, however, their effectiveness is limited by the accurate recognition\u0000of surgical phases. The intricate nature of ESD, with different lesion\u0000characteristics and tissue structures, presents challenges for real-time\u0000surgical phase recognition algorithms. Existing surgical phase recognition\u0000algorithms struggle to efficiently capture temporal contexts in video-based\u0000scenarios, leading to insufficient performance. To address these issues, we\u0000propose SPRMamba, a novel Mamba-based framework for ESD surgical phase\u0000recognition. SPRMamba leverages the strengths of Mamba for long-term temporal\u0000modeling while introducing the Scaled Residual TranMamba block to enhance the\u0000capture of fine-grained details, overcoming the limitations of traditional\u0000temporal models like Temporal Convolutional Networks and Transformers.\u0000Moreover, a Temporal Sample Strategy is introduced to accelerate the\u0000processing, which is essential for real-time phase recognition in clinical\u0000settings. Extensive testing on the ESD385 dataset and the cholecystectomy\u0000Cholec80 dataset demonstrates that SPRMamba surpasses existing state-of-the-art\u0000methods and exhibits greater robustness across various surgical phase\u0000recognition tasks.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"155 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BRDF-NeRF: Neural Radiance Fields with Optical Satellite Images and BRDF Modelling BRDF-NeRF：利用光学卫星图像和 BRDF 建模的神经辐射场

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.12014

Lulin Zhang, Ewelina Rupnik, Tri Dung Nguyen, Stéphane Jacquemoud, Yann Klinger

{"title":"BRDF-NeRF: Neural Radiance Fields with Optical Satellite Images and BRDF Modelling","authors":"Lulin Zhang, Ewelina Rupnik, Tri Dung Nguyen, Stéphane Jacquemoud, Yann Klinger","doi":"arxiv-2409.12014","DOIUrl":"https://doi.org/arxiv-2409.12014","url":null,"abstract":"Understanding the anisotropic reflectance of complex Earth surfaces from\u0000satellite imagery is crucial for numerous applications. Neural radiance fields\u0000(NeRF) have become popular as a machine learning technique capable of deducing\u0000the bidirectional reflectance distribution function (BRDF) of a scene from\u0000multiple images. However, prior research has largely concentrated on applying\u0000NeRF to close-range imagery, estimating basic Microfacet BRDF models, which\u0000fall short for many Earth surfaces. Moreover, high-quality NeRFs generally\u0000require several images captured simultaneously, a rare occurrence in satellite\u0000imaging. To address these limitations, we propose BRDF-NeRF, developed to\u0000explicitly estimate the Rahman-Pinty-Verstraete (RPV) model, a semi-empirical\u0000BRDF model commonly employed in remote sensing. We assess our approach using\u0000two datasets: (1) Djibouti, captured in a single epoch at varying viewing\u0000angles with a fixed Sun position, and (2) Lanzhou, captured over multiple\u0000epochs with different viewing angles and Sun positions. Our results, based on\u0000only three to four satellite images for training, demonstrate that BRDF-NeRF\u0000can effectively synthesize novel views from directions far removed from the\u0000training data and produce high-quality digital surface models (DSMs).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"22 2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression Free-VSC：来自视觉基础模型的自由语义，用于无监督视频语义压缩

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11718

Yuan Tian, Guo Lu, Guangtao Zhai

{"title":"Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression","authors":"Yuan Tian, Guo Lu, Guangtao Zhai","doi":"arxiv-2409.11718","DOIUrl":"https://doi.org/arxiv-2409.11718","url":null,"abstract":"Unsupervised video semantic compression (UVSC), i.e., compressing videos to\u0000better support various analysis tasks, has recently garnered attention.\u0000However, the semantic richness of previous methods remains limited, due to the\u0000single semantic learning objective, limited training data, etc. To address\u0000this, we propose to boost the UVSC task by absorbing the off-the-shelf rich\u0000semantics from VFMs. Specifically, we introduce a VFMs-shared semantic\u0000alignment layer, complemented by VFM-specific prompts, to flexibly align\u0000semantics between the compressed video and various VFMs. This allows different\u0000VFMs to collaboratively build a mutually-enhanced semantic space, guiding the\u0000learning of the compression model. Moreover, we introduce a dynamic\u0000trajectory-based inter-frame compression scheme, which first estimates the\u0000semantic trajectory based on the historical content, and then traverses along\u0000the trajectory to predict the future semantics as the coding context. This\u0000reduces the overall bitcost of the system, further improving the compression\u0000efficiency. Our approach outperforms previous coding methods on three\u0000mainstream tasks and six datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance 脑流：多模态引导下的 fMRI 图像重构

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.12099

Jaehoon Joo, Taejin Jeong, Seongjae Hwang

{"title":"Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance","authors":"Jaehoon Joo, Taejin Jeong, Seongjae Hwang","doi":"arxiv-2409.12099","DOIUrl":"https://doi.org/arxiv-2409.12099","url":null,"abstract":"Understanding how humans process visual information is one of the crucial\u0000steps for unraveling the underlying mechanism of brain activity. Recently, this\u0000curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI\u0000data from visual stimuli, it aims to reconstruct the corresponding visual\u0000stimuli. Surprisingly, leveraging powerful generative models such as the Latent\u0000Diffusion Model (LDM) has shown promising results in reconstructing complex\u0000visual stimuli such as high-resolution natural images from vision datasets.\u0000Despite the impressive structural fidelity of these reconstructions, they often\u0000lack details of small objects, ambiguous shapes, and semantic nuances.\u0000Consequently, the incorporation of additional semantic knowledge, beyond mere\u0000visuals, becomes imperative. In light of this, we exploit how modern LDMs\u0000effectively incorporate multi-modal guidance (text guidance, visual guidance,\u0000and image layout) for structurally and semantically plausible image\u0000generations. Specifically, inspired by the two-streams hypothesis suggesting\u0000that perceptual and semantic information are processed in different brain\u0000regions, our framework, Brain-Streams, maps fMRI signals from these brain\u0000regions to appropriate embeddings. That is, by extracting textual guidance from\u0000semantic information regions and visual guidance from perceptual information\u0000regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We\u0000validate the reconstruction ability of Brain-Streams both quantitatively and\u0000qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI\u0000data.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework RockTrack：3D Robust Multi-Camera-Ken 多目标跟踪框架

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11749

Xiaoyu Li, Peidong Li, Lijun Zhao, Dedong Liu, Jinghan Gao, Xian Wu, Yitao Wu, Dixiao Cui

{"title":"RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework","authors":"Xiaoyu Li, Peidong Li, Lijun Zhao, Dedong Liu, Jinghan Gao, Xian Wu, Yitao Wu, Dixiao Cui","doi":"arxiv-2409.11749","DOIUrl":"https://doi.org/arxiv-2409.11749","url":null,"abstract":"3D Multi-Object Tracking (MOT) obtains significant performance improvements\u0000with the rapid advancements in 3D object detection, particularly in\u0000cost-effective multi-camera setups. However, the prevalent end-to-end training\u0000approach for multi-camera trackers results in detector-specific models,\u0000limiting their versatility. Moreover, current generic trackers overlook the\u0000unique features of multi-camera detectors, i.e., the unreliability of motion\u0000observations and the feasibility of visual information. To address these\u0000challenges, we propose RockTrack, a 3D MOT method for multi-camera detectors.\u0000Following the Tracking-By-Detection framework, RockTrack is compatible with\u0000various off-the-shelf detectors. RockTrack incorporates a confidence-guided\u0000preprocessing module to extract reliable motion and image observations from\u0000distinct representation spaces from a single detector. These observations are\u0000then fused in an association module that leverages geometric and appearance\u0000cues to minimize mismatches. The resulting matches are propagated through a\u0000staged estimation process, forming the basis for heuristic noise modeling.\u0000Additionally, we introduce a novel appearance similarity metric for explicitly\u0000characterizing object affinities in multi-camera settings. RockTrack achieves\u0000state-of-the-art performance on the nuScenes vision-only tracking leaderboard\u0000with 59.1% AMOTA while demonstrating impressive computational efficiency.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LEMON: Localized Editing with Mesh Optimization and Neural Shaders LEMON：利用网格优化和神经着色器进行局部编辑

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.12024

Furkan Mert Algan, Umut Yazgan, Driton Salihu, Cem Eteke, Eckehard Steinbach

{"title":"LEMON: Localized Editing with Mesh Optimization and Neural Shaders","authors":"Furkan Mert Algan, Umut Yazgan, Driton Salihu, Cem Eteke, Eckehard Steinbach","doi":"arxiv-2409.12024","DOIUrl":"https://doi.org/arxiv-2409.12024","url":null,"abstract":"In practical use cases, polygonal mesh editing can be faster than generating\u0000new ones, but it can still be challenging and time-consuming for users.\u0000Existing solutions for this problem tend to focus on a single task, either\u0000geometry or novel view synthesis, which often leads to disjointed results\u0000between the mesh and view. In this work, we propose LEMON, a mesh editing\u0000pipeline that combines neural deferred shading with localized mesh\u0000optimization. Our approach begins by identifying the most important vertices in\u0000the mesh for editing, utilizing a segmentation model to focus on these key\u0000regions. Given multi-view images of an object, we optimize a neural shader and\u0000a polygonal mesh while extracting the normal map and the rendered image from\u0000each view. By using these outputs as conditioning data, we edit the input\u0000images with a text-to-image diffusion model and iteratively update our dataset\u0000while deforming the mesh. This process results in a polygonal mesh that is\u0000edited according to the given text instruction, preserving the geometric\u0000characteristics of the initial mesh while focusing on the most significant\u0000areas. We evaluate our pipeline using the DTU dataset, demonstrating that it\u0000generates finely-edited meshes more rapidly than the current state-of-the-art\u0000methods. We include our code and additional results in the supplementary\u0000material.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mixture of Prompt Learning for Vision Language Models 视觉语言模型的混合提示学习

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.12011

Yu Du, Tong Niu, Rong Zhao

{"title":"Mixture of Prompt Learning for Vision Language Models","authors":"Yu Du, Tong Niu, Rong Zhao","doi":"arxiv-2409.12011","DOIUrl":"https://doi.org/arxiv-2409.12011","url":null,"abstract":"As powerful pre-trained vision-language models (VLMs) like CLIP gain\u0000prominence, numerous studies have attempted to combine VLMs for downstream\u0000tasks. Among these, prompt learning has been validated as an effective method\u0000for adapting to new tasks, which only requiring a small number of parameters.\u0000However, current prompt learning methods face two challenges: first, a single\u0000soft prompt struggles to capture the diverse styles and patterns within a\u0000dataset; second, fine-tuning soft prompts is prone to overfitting. To address\u0000these challenges, we propose a mixture of soft prompt learning method\u0000incorporating a routing module. This module is able to capture a dataset's\u0000varied styles and dynamically selects the most suitable prompts for each\u0000instance. Additionally, we introduce a novel gating mechanism to ensure the\u0000router selects prompts based on their similarity to hard prompt templates,\u0000which both retaining knowledge from hard prompts and improving selection\u0000accuracy. We also implement semantically grouped text-level supervision,\u0000initializing each soft prompt with the token embeddings of manually designed\u0000templates from its group and applied a contrastive loss between the resulted\u0000text feature and hard prompt encoded text feature. This supervision ensures\u0000that the text features derived from soft prompts remain close to those from\u0000their corresponding hard prompts, preserving initial knowledge and mitigating\u0000overfitting. Our method has been validated on 11 datasets, demonstrating\u0000evident improvements in few-shot learning, domain generalization, and\u0000base-to-new generalization scenarios compared to existing baselines. The code\u0000will be available at url{https://anonymous.4open.science/r/mocoop-6387}","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0