LoRA-CLIP：基于大CLIP基础模型的高效低秩自适应场景分类

IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society Pub Date : 2025-07-16 DOI:10.1109/LGRS.2025.3589738

Mohamad Mahmoud Al Rahhal;Yakoub Bazi;Mansour Zuair

{"title":"LoRA-CLIP：基于大CLIP基础模型的高效低秩自适应场景分类","authors":"Mohamad Mahmoud Al Rahhal;Yakoub Bazi;Mansour Zuair","doi":"10.1109/LGRS.2025.3589738","DOIUrl":null,"url":null,"abstract":"Scene classification in optical remote sensing (RS) imagery has been extensively investigated using both learning-from-scratch approaches and fine-tuning of ImageNet pretrained models. Meanwhile, contrastive language-image pretraining (CLIP) has emerged as a powerful foundation model for vision-language tasks, demonstrating remarkable zero-shot capabilities across various domains. Its image encoder is a key component in many vision instruction-tuning models, enabling effective alignment of text and visual modalities for diverse tasks. However, its potential for supervised RS scene classification remains unexplored. This work investigates the efficient adaptation of large CLIP models (containing over 300 M parameters) through low-rank adaptation (LoRA), specifically targeting the attention layers. By applying LoRA to CLIP’s attention mechanisms, we can effectively adapt the vision model for specialized scene classification tasks with minimal computational overhead, requiring fewer training epochs than traditional fine-tuning methods. Our extensive experiments demonstrate the promising capabilities of LoRA-CLIP. By training only on a small set of additional parameters, LoRA-CLIP outperforms models pretrained on ImageNet, demonstrating the clear advantages of using image–text pretrained backbones for scene classification.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LoRA-CLIP: Efficient Low-Rank Adaptation of Large CLIP Foundation Model for Scene Classification\",\"authors\":\"Mohamad Mahmoud Al Rahhal;Yakoub Bazi;Mansour Zuair\",\"doi\":\"10.1109/LGRS.2025.3589738\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Scene classification in optical remote sensing (RS) imagery has been extensively investigated using both learning-from-scratch approaches and fine-tuning of ImageNet pretrained models. Meanwhile, contrastive language-image pretraining (CLIP) has emerged as a powerful foundation model for vision-language tasks, demonstrating remarkable zero-shot capabilities across various domains. Its image encoder is a key component in many vision instruction-tuning models, enabling effective alignment of text and visual modalities for diverse tasks. However, its potential for supervised RS scene classification remains unexplored. This work investigates the efficient adaptation of large CLIP models (containing over 300 M parameters) through low-rank adaptation (LoRA), specifically targeting the attention layers. By applying LoRA to CLIP’s attention mechanisms, we can effectively adapt the vision model for specialized scene classification tasks with minimal computational overhead, requiring fewer training epochs than traditional fine-tuning methods. Our extensive experiments demonstrate the promising capabilities of LoRA-CLIP. By training only on a small set of additional parameters, LoRA-CLIP outperforms models pretrained on ImageNet, demonstrating the clear advantages of using image–text pretrained backbones for scene classification.\",\"PeriodicalId\":91017,\"journal\":{\"name\":\"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society\",\"volume\":\"22 \",\"pages\":\"1-5\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11082270/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11082270/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

利用从头开始学习的方法和ImageNet预训练模型的微调，对光学遥感（RS）图像中的场景分类进行了广泛的研究。与此同时，对比语言-图像预训练（CLIP）已经成为视觉语言任务的一个强大的基础模型，在各个领域都表现出卓越的零射击能力。它的图像编码器是许多视觉指令调整模型的关键组件，能够有效地对不同任务的文本和视觉模式进行对齐。然而，它在有监督的RS场景分类方面的潜力仍未得到探索。本研究通过低秩自适应（LoRA）研究了大型CLIP模型（包含超过300 M个参数）的有效自适应，特别是针对注意层。通过将LoRA应用于CLIP的注意力机制，我们可以以最小的计算开销有效地适应特定场景分类任务的视觉模型，比传统的微调方法需要更少的训练epoch。我们的大量实验证明了LoRA-CLIP的潜力。通过只训练一小部分额外的参数，LoRA-CLIP优于在ImageNet上预训练的模型，展示了使用图像-文本预训练主干进行场景分类的明显优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

LoRA-CLIP: Efficient Low-Rank Adaptation of Large CLIP Foundation Model for Scene Classification

Scene classification in optical remote sensing (RS) imagery has been extensively investigated using both learning-from-scratch approaches and fine-tuning of ImageNet pretrained models. Meanwhile, contrastive language-image pretraining (CLIP) has emerged as a powerful foundation model for vision-language tasks, demonstrating remarkable zero-shot capabilities across various domains. Its image encoder is a key component in many vision instruction-tuning models, enabling effective alignment of text and visual modalities for diverse tasks. However, its potential for supervised RS scene classification remains unexplored. This work investigates the efficient adaptation of large CLIP models (containing over 300 M parameters) through low-rank adaptation (LoRA), specifically targeting the attention layers. By applying LoRA to CLIP’s attention mechanisms, we can effectively adapt the vision model for specialized scene classification tasks with minimal computational overhead, requiring fewer training epochs than traditional fine-tuning methods. Our extensive experiments demonstrate the promising capabilities of LoRA-CLIP. By training only on a small set of additional parameters, LoRA-CLIP outperforms models pretrained on ImageNet, demonstrating the clear advantages of using image–text pretrained backbones for scene classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society

自引率

0.00%

发文量