arXiv - CS - Computer Vision and Pattern Recognition最新文献

筛选
英文 中文
SpheriGait: Enriching Spatial Representation via Spherical Projection for LiDAR-based Gait Recognition SpheriGait:通过基于激光雷达的步态识别的球面投影丰富空间表示
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11869
Yanxi Wang, Zhigang Chang, Chen Wu, Zihao Cheng, Hongmin Gao
{"title":"SpheriGait: Enriching Spatial Representation via Spherical Projection for LiDAR-based Gait Recognition","authors":"Yanxi Wang, Zhigang Chang, Chen Wu, Zihao Cheng, Hongmin Gao","doi":"arxiv-2409.11869","DOIUrl":"https://doi.org/arxiv-2409.11869","url":null,"abstract":"Gait recognition is a rapidly progressing technique for the remote\u0000identification of individuals. Prior research predominantly employing 2D\u0000sensors to gather gait data has achieved notable advancements; nonetheless,\u0000they have unavoidably neglected the influence of 3D dynamic characteristics on\u0000recognition. Gait recognition utilizing LiDAR 3D point clouds not only directly\u0000captures 3D spatial features but also diminishes the impact of lighting\u0000conditions while ensuring privacy protection.The essence of the problem lies in\u0000how to effectively extract discriminative 3D dynamic representation from point\u0000clouds.In this paper, we proposes a method named SpheriGait for extracting and\u0000enhancing dynamic features from point clouds for Lidar-based gait recognition.\u0000Specifically, it substitutes the conventional point cloud plane projection\u0000method with spherical projection to augment the perception of dynamic\u0000feature.Additionally, a network block named DAM-L is proposed to extract gait\u0000cues from the projected point cloud data. We conducted extensive experiments\u0000and the results demonstrated the SpheriGait achieved state-of-the-art\u0000performance on the SUSTech1K dataset, and verified that the spherical\u0000projection method can serve as a universal data preprocessing technique to\u0000enhance the performance of other LiDAR-based gait recognition methods,\u0000exhibiting exceptional flexibility and practicality.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Precise Forecasting of Sky Images Using Spatial Warping 利用空间扭曲技术精确预报天空图像
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.12162
Leron Julian, Aswin C. Sankaranarayanan
{"title":"Precise Forecasting of Sky Images Using Spatial Warping","authors":"Leron Julian, Aswin C. Sankaranarayanan","doi":"arxiv-2409.12162","DOIUrl":"https://doi.org/arxiv-2409.12162","url":null,"abstract":"The intermittency of solar power, due to occlusion from cloud cover, is one\u0000of the key factors inhibiting its widespread use in both commercial and\u0000residential settings. Hence, real-time forecasting of solar irradiance for\u0000grid-connected photovoltaic systems is necessary to schedule and allocate\u0000resources across the grid. Ground-based imagers that capture wide field-of-view\u0000images of the sky are commonly used to monitor cloud movement around a\u0000particular site in an effort to forecast solar irradiance. However, these wide\u0000FOV imagers capture a distorted image of sky image, where regions near the\u0000horizon are heavily compressed. This hinders the ability to precisely predict\u0000cloud motion near the horizon which especially affects prediction over longer\u0000time horizons. In this work, we combat the aforementioned constraint by\u0000introducing a deep learning method to predict a future sky image frame with\u0000higher resolution than previous methods. Our main contribution is to derive an\u0000optimal warping method to counter the adverse affects of clouds at the horizon,\u0000and learn a framework for future sky image prediction which better determines\u0000cloud evolution for longer time horizons.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Latent fingerprint enhancement for accurate minutiae detection 增强潜伏指纹,实现精确的细节检测
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11802
Abdul Wahab, Tariq Mahmood Khan, Shahzaib Iqbal, Bandar AlShammari, Bandar Alhaqbani, Imran Razzak
{"title":"Latent fingerprint enhancement for accurate minutiae detection","authors":"Abdul Wahab, Tariq Mahmood Khan, Shahzaib Iqbal, Bandar AlShammari, Bandar Alhaqbani, Imran Razzak","doi":"arxiv-2409.11802","DOIUrl":"https://doi.org/arxiv-2409.11802","url":null,"abstract":"Identification of suspects based on partial and smudged fingerprints,\u0000commonly referred to as fingermarks or latent fingerprints, presents a\u0000significant challenge in the field of fingerprint recognition. Although\u0000fixed-length embeddings have shown effectiveness in recognising rolled and slap\u0000fingerprints, the methods for matching latent fingerprints have primarily\u0000centred around local minutiae-based embeddings, failing to fully exploit global\u0000representations for matching purposes. Consequently, enhancing latent\u0000fingerprints becomes critical to ensuring robust identification for forensic\u0000investigations. Current approaches often prioritise restoring ridge patterns,\u0000overlooking the fine-macroeconomic details crucial for accurate fingerprint\u0000recognition. To address this, we propose a novel approach that uses generative\u0000adversary networks (GANs) to redefine Latent Fingerprint Enhancement (LFE)\u0000through a structured approach to fingerprint generation. By directly optimising\u0000the minutiae information during the generation process, the model produces\u0000enhanced latent fingerprints that exhibit exceptional fidelity to ground-truth\u0000instances. This leads to a significant improvement in identification\u0000performance. Our framework integrates minutiae locations and orientation\u0000fields, ensuring the preservation of both local and structural fingerprint\u0000features. Extensive evaluations conducted on two publicly available datasets\u0000demonstrate our method's dominance over existing state-of-the-art techniques,\u0000highlighting its potential to significantly enhance latent fingerprint\u0000recognition accuracy in forensic applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPRMamba: Surgical Phase Recognition for Endoscopic Submucosal Dissection with Mamba SPRMamba:使用 Mamba 进行内窥镜粘膜下剥离的手术阶段识别
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.12108
Xiangning Zhang, Jinnan Chen, Qingwei Zhang, Chengfeng Zhou, Zhengjie Zhang, Xiaobo Li, Dahong Qian
{"title":"SPRMamba: Surgical Phase Recognition for Endoscopic Submucosal Dissection with Mamba","authors":"Xiangning Zhang, Jinnan Chen, Qingwei Zhang, Chengfeng Zhou, Zhengjie Zhang, Xiaobo Li, Dahong Qian","doi":"arxiv-2409.12108","DOIUrl":"https://doi.org/arxiv-2409.12108","url":null,"abstract":"Endoscopic Submucosal Dissection (ESD) is a minimally invasive procedure\u0000initially designed for the treatment of early gastric cancer but is now widely\u0000used for various gastrointestinal lesions. Computer-assisted Surgery systems\u0000have played a crucial role in improving the precision and safety of ESD\u0000procedures, however, their effectiveness is limited by the accurate recognition\u0000of surgical phases. The intricate nature of ESD, with different lesion\u0000characteristics and tissue structures, presents challenges for real-time\u0000surgical phase recognition algorithms. Existing surgical phase recognition\u0000algorithms struggle to efficiently capture temporal contexts in video-based\u0000scenarios, leading to insufficient performance. To address these issues, we\u0000propose SPRMamba, a novel Mamba-based framework for ESD surgical phase\u0000recognition. SPRMamba leverages the strengths of Mamba for long-term temporal\u0000modeling while introducing the Scaled Residual TranMamba block to enhance the\u0000capture of fine-grained details, overcoming the limitations of traditional\u0000temporal models like Temporal Convolutional Networks and Transformers.\u0000Moreover, a Temporal Sample Strategy is introduced to accelerate the\u0000processing, which is essential for real-time phase recognition in clinical\u0000settings. Extensive testing on the ESD385 dataset and the cholecystectomy\u0000Cholec80 dataset demonstrates that SPRMamba surpasses existing state-of-the-art\u0000methods and exhibits greater robustness across various surgical phase\u0000recognition tasks.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BRDF-NeRF: Neural Radiance Fields with Optical Satellite Images and BRDF Modelling BRDF-NeRF:利用光学卫星图像和 BRDF 建模的神经辐射场
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.12014
Lulin Zhang, Ewelina Rupnik, Tri Dung Nguyen, Stéphane Jacquemoud, Yann Klinger
{"title":"BRDF-NeRF: Neural Radiance Fields with Optical Satellite Images and BRDF Modelling","authors":"Lulin Zhang, Ewelina Rupnik, Tri Dung Nguyen, Stéphane Jacquemoud, Yann Klinger","doi":"arxiv-2409.12014","DOIUrl":"https://doi.org/arxiv-2409.12014","url":null,"abstract":"Understanding the anisotropic reflectance of complex Earth surfaces from\u0000satellite imagery is crucial for numerous applications. Neural radiance fields\u0000(NeRF) have become popular as a machine learning technique capable of deducing\u0000the bidirectional reflectance distribution function (BRDF) of a scene from\u0000multiple images. However, prior research has largely concentrated on applying\u0000NeRF to close-range imagery, estimating basic Microfacet BRDF models, which\u0000fall short for many Earth surfaces. Moreover, high-quality NeRFs generally\u0000require several images captured simultaneously, a rare occurrence in satellite\u0000imaging. To address these limitations, we propose BRDF-NeRF, developed to\u0000explicitly estimate the Rahman-Pinty-Verstraete (RPV) model, a semi-empirical\u0000BRDF model commonly employed in remote sensing. We assess our approach using\u0000two datasets: (1) Djibouti, captured in a single epoch at varying viewing\u0000angles with a fixed Sun position, and (2) Lanzhou, captured over multiple\u0000epochs with different viewing angles and Sun positions. Our results, based on\u0000only three to four satellite images for training, demonstrate that BRDF-NeRF\u0000can effectively synthesize novel views from directions far removed from the\u0000training data and produce high-quality digital surface models (DSMs).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression Free-VSC:来自视觉基础模型的自由语义,用于无监督视频语义压缩
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11718
Yuan Tian, Guo Lu, Guangtao Zhai
{"title":"Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression","authors":"Yuan Tian, Guo Lu, Guangtao Zhai","doi":"arxiv-2409.11718","DOIUrl":"https://doi.org/arxiv-2409.11718","url":null,"abstract":"Unsupervised video semantic compression (UVSC), i.e., compressing videos to\u0000better support various analysis tasks, has recently garnered attention.\u0000However, the semantic richness of previous methods remains limited, due to the\u0000single semantic learning objective, limited training data, etc. To address\u0000this, we propose to boost the UVSC task by absorbing the off-the-shelf rich\u0000semantics from VFMs. Specifically, we introduce a VFMs-shared semantic\u0000alignment layer, complemented by VFM-specific prompts, to flexibly align\u0000semantics between the compressed video and various VFMs. This allows different\u0000VFMs to collaboratively build a mutually-enhanced semantic space, guiding the\u0000learning of the compression model. Moreover, we introduce a dynamic\u0000trajectory-based inter-frame compression scheme, which first estimates the\u0000semantic trajectory based on the historical content, and then traverses along\u0000the trajectory to predict the future semantics as the coding context. This\u0000reduces the overall bitcost of the system, further improving the compression\u0000efficiency. Our approach outperforms previous coding methods on three\u0000mainstream tasks and six datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance 脑流:多模态引导下的 fMRI 图像重构
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.12099
Jaehoon Joo, Taejin Jeong, Seongjae Hwang
{"title":"Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance","authors":"Jaehoon Joo, Taejin Jeong, Seongjae Hwang","doi":"arxiv-2409.12099","DOIUrl":"https://doi.org/arxiv-2409.12099","url":null,"abstract":"Understanding how humans process visual information is one of the crucial\u0000steps for unraveling the underlying mechanism of brain activity. Recently, this\u0000curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI\u0000data from visual stimuli, it aims to reconstruct the corresponding visual\u0000stimuli. Surprisingly, leveraging powerful generative models such as the Latent\u0000Diffusion Model (LDM) has shown promising results in reconstructing complex\u0000visual stimuli such as high-resolution natural images from vision datasets.\u0000Despite the impressive structural fidelity of these reconstructions, they often\u0000lack details of small objects, ambiguous shapes, and semantic nuances.\u0000Consequently, the incorporation of additional semantic knowledge, beyond mere\u0000visuals, becomes imperative. In light of this, we exploit how modern LDMs\u0000effectively incorporate multi-modal guidance (text guidance, visual guidance,\u0000and image layout) for structurally and semantically plausible image\u0000generations. Specifically, inspired by the two-streams hypothesis suggesting\u0000that perceptual and semantic information are processed in different brain\u0000regions, our framework, Brain-Streams, maps fMRI signals from these brain\u0000regions to appropriate embeddings. That is, by extracting textual guidance from\u0000semantic information regions and visual guidance from perceptual information\u0000regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We\u0000validate the reconstruction ability of Brain-Streams both quantitatively and\u0000qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI\u0000data.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework RockTrack:3D Robust Multi-Camera-Ken 多目标跟踪框架
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11749
Xiaoyu Li, Peidong Li, Lijun Zhao, Dedong Liu, Jinghan Gao, Xian Wu, Yitao Wu, Dixiao Cui
{"title":"RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework","authors":"Xiaoyu Li, Peidong Li, Lijun Zhao, Dedong Liu, Jinghan Gao, Xian Wu, Yitao Wu, Dixiao Cui","doi":"arxiv-2409.11749","DOIUrl":"https://doi.org/arxiv-2409.11749","url":null,"abstract":"3D Multi-Object Tracking (MOT) obtains significant performance improvements\u0000with the rapid advancements in 3D object detection, particularly in\u0000cost-effective multi-camera setups. However, the prevalent end-to-end training\u0000approach for multi-camera trackers results in detector-specific models,\u0000limiting their versatility. Moreover, current generic trackers overlook the\u0000unique features of multi-camera detectors, i.e., the unreliability of motion\u0000observations and the feasibility of visual information. To address these\u0000challenges, we propose RockTrack, a 3D MOT method for multi-camera detectors.\u0000Following the Tracking-By-Detection framework, RockTrack is compatible with\u0000various off-the-shelf detectors. RockTrack incorporates a confidence-guided\u0000preprocessing module to extract reliable motion and image observations from\u0000distinct representation spaces from a single detector. These observations are\u0000then fused in an association module that leverages geometric and appearance\u0000cues to minimize mismatches. The resulting matches are propagated through a\u0000staged estimation process, forming the basis for heuristic noise modeling.\u0000Additionally, we introduce a novel appearance similarity metric for explicitly\u0000characterizing object affinities in multi-camera settings. RockTrack achieves\u0000state-of-the-art performance on the nuScenes vision-only tracking leaderboard\u0000with 59.1% AMOTA while demonstrating impressive computational efficiency.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LEMON: Localized Editing with Mesh Optimization and Neural Shaders LEMON:利用网格优化和神经着色器进行局部编辑
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.12024
Furkan Mert Algan, Umut Yazgan, Driton Salihu, Cem Eteke, Eckehard Steinbach
{"title":"LEMON: Localized Editing with Mesh Optimization and Neural Shaders","authors":"Furkan Mert Algan, Umut Yazgan, Driton Salihu, Cem Eteke, Eckehard Steinbach","doi":"arxiv-2409.12024","DOIUrl":"https://doi.org/arxiv-2409.12024","url":null,"abstract":"In practical use cases, polygonal mesh editing can be faster than generating\u0000new ones, but it can still be challenging and time-consuming for users.\u0000Existing solutions for this problem tend to focus on a single task, either\u0000geometry or novel view synthesis, which often leads to disjointed results\u0000between the mesh and view. In this work, we propose LEMON, a mesh editing\u0000pipeline that combines neural deferred shading with localized mesh\u0000optimization. Our approach begins by identifying the most important vertices in\u0000the mesh for editing, utilizing a segmentation model to focus on these key\u0000regions. Given multi-view images of an object, we optimize a neural shader and\u0000a polygonal mesh while extracting the normal map and the rendered image from\u0000each view. By using these outputs as conditioning data, we edit the input\u0000images with a text-to-image diffusion model and iteratively update our dataset\u0000while deforming the mesh. This process results in a polygonal mesh that is\u0000edited according to the given text instruction, preserving the geometric\u0000characteristics of the initial mesh while focusing on the most significant\u0000areas. We evaluate our pipeline using the DTU dataset, demonstrating that it\u0000generates finely-edited meshes more rapidly than the current state-of-the-art\u0000methods. We include our code and additional results in the supplementary\u0000material.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mixture of Prompt Learning for Vision Language Models 视觉语言模型的混合提示学习
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.12011
Yu Du, Tong Niu, Rong Zhao
{"title":"Mixture of Prompt Learning for Vision Language Models","authors":"Yu Du, Tong Niu, Rong Zhao","doi":"arxiv-2409.12011","DOIUrl":"https://doi.org/arxiv-2409.12011","url":null,"abstract":"As powerful pre-trained vision-language models (VLMs) like CLIP gain\u0000prominence, numerous studies have attempted to combine VLMs for downstream\u0000tasks. Among these, prompt learning has been validated as an effective method\u0000for adapting to new tasks, which only requiring a small number of parameters.\u0000However, current prompt learning methods face two challenges: first, a single\u0000soft prompt struggles to capture the diverse styles and patterns within a\u0000dataset; second, fine-tuning soft prompts is prone to overfitting. To address\u0000these challenges, we propose a mixture of soft prompt learning method\u0000incorporating a routing module. This module is able to capture a dataset's\u0000varied styles and dynamically selects the most suitable prompts for each\u0000instance. Additionally, we introduce a novel gating mechanism to ensure the\u0000router selects prompts based on their similarity to hard prompt templates,\u0000which both retaining knowledge from hard prompts and improving selection\u0000accuracy. We also implement semantically grouped text-level supervision,\u0000initializing each soft prompt with the token embeddings of manually designed\u0000templates from its group and applied a contrastive loss between the resulted\u0000text feature and hard prompt encoded text feature. This supervision ensures\u0000that the text features derived from soft prompts remain close to those from\u0000their corresponding hard prompts, preserving initial knowledge and mitigating\u0000overfitting. Our method has been validated on 11 datasets, demonstrating\u0000evident improvements in few-shot learning, domain generalization, and\u0000base-to-new generalization scenarios compared to existing baselines. The code\u0000will be available at url{https://anonymous.4open.science/r/mocoop-6387}","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信