Mohamad Mahmoud Al Rahhal;Yakoub Bazi;Mansour Zuair
{"title":"LoRA-CLIP:基于大CLIP基础模型的高效低秩自适应场景分类","authors":"Mohamad Mahmoud Al Rahhal;Yakoub Bazi;Mansour Zuair","doi":"10.1109/LGRS.2025.3589738","DOIUrl":null,"url":null,"abstract":"Scene classification in optical remote sensing (RS) imagery has been extensively investigated using both learning-from-scratch approaches and fine-tuning of ImageNet pretrained models. Meanwhile, contrastive language-image pretraining (CLIP) has emerged as a powerful foundation model for vision-language tasks, demonstrating remarkable zero-shot capabilities across various domains. Its image encoder is a key component in many vision instruction-tuning models, enabling effective alignment of text and visual modalities for diverse tasks. However, its potential for supervised RS scene classification remains unexplored. This work investigates the efficient adaptation of large CLIP models (containing over 300 M parameters) through low-rank adaptation (LoRA), specifically targeting the attention layers. By applying LoRA to CLIP’s attention mechanisms, we can effectively adapt the vision model for specialized scene classification tasks with minimal computational overhead, requiring fewer training epochs than traditional fine-tuning methods. Our extensive experiments demonstrate the promising capabilities of LoRA-CLIP. By training only on a small set of additional parameters, LoRA-CLIP outperforms models pretrained on ImageNet, demonstrating the clear advantages of using image–text pretrained backbones for scene classification.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LoRA-CLIP: Efficient Low-Rank Adaptation of Large CLIP Foundation Model for Scene Classification\",\"authors\":\"Mohamad Mahmoud Al Rahhal;Yakoub Bazi;Mansour Zuair\",\"doi\":\"10.1109/LGRS.2025.3589738\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Scene classification in optical remote sensing (RS) imagery has been extensively investigated using both learning-from-scratch approaches and fine-tuning of ImageNet pretrained models. Meanwhile, contrastive language-image pretraining (CLIP) has emerged as a powerful foundation model for vision-language tasks, demonstrating remarkable zero-shot capabilities across various domains. Its image encoder is a key component in many vision instruction-tuning models, enabling effective alignment of text and visual modalities for diverse tasks. However, its potential for supervised RS scene classification remains unexplored. This work investigates the efficient adaptation of large CLIP models (containing over 300 M parameters) through low-rank adaptation (LoRA), specifically targeting the attention layers. By applying LoRA to CLIP’s attention mechanisms, we can effectively adapt the vision model for specialized scene classification tasks with minimal computational overhead, requiring fewer training epochs than traditional fine-tuning methods. Our extensive experiments demonstrate the promising capabilities of LoRA-CLIP. By training only on a small set of additional parameters, LoRA-CLIP outperforms models pretrained on ImageNet, demonstrating the clear advantages of using image–text pretrained backbones for scene classification.\",\"PeriodicalId\":91017,\"journal\":{\"name\":\"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society\",\"volume\":\"22 \",\"pages\":\"1-5\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11082270/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11082270/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
LoRA-CLIP: Efficient Low-Rank Adaptation of Large CLIP Foundation Model for Scene Classification
Scene classification in optical remote sensing (RS) imagery has been extensively investigated using both learning-from-scratch approaches and fine-tuning of ImageNet pretrained models. Meanwhile, contrastive language-image pretraining (CLIP) has emerged as a powerful foundation model for vision-language tasks, demonstrating remarkable zero-shot capabilities across various domains. Its image encoder is a key component in many vision instruction-tuning models, enabling effective alignment of text and visual modalities for diverse tasks. However, its potential for supervised RS scene classification remains unexplored. This work investigates the efficient adaptation of large CLIP models (containing over 300 M parameters) through low-rank adaptation (LoRA), specifically targeting the attention layers. By applying LoRA to CLIP’s attention mechanisms, we can effectively adapt the vision model for specialized scene classification tasks with minimal computational overhead, requiring fewer training epochs than traditional fine-tuning methods. Our extensive experiments demonstrate the promising capabilities of LoRA-CLIP. By training only on a small set of additional parameters, LoRA-CLIP outperforms models pretrained on ImageNet, demonstrating the clear advantages of using image–text pretrained backbones for scene classification.