{"title":"使用大型语言模型的视觉语言模型的双适配器调优。","authors":"Mohammad Reza Zarei, Abbas Akkasi, Majid Komeili","doi":"10.1007/s44196-025-00853-0","DOIUrl":null,"url":null,"abstract":"<p><p>Vision-language models (VLMs) pre-trained on large-scale image-text pairs have shown impressive results in zero-shot vision tasks. Knowledge transferability of these models can be further improved with the help of a limited number of samples. Feature adapter tuning is a prominent approach employed for efficient transfer learning (ETL). However, most of the previous ETL models focus on tuning either prior-independent or prior-dependent feature adapters. We propose a novel ETL approach that leverages both adapter styles simultaneously. Additionally, most existing ETL models rely on using textual prompts constructed by completing general pre-defined templates. This approach neglects the descriptive knowledge that can assist VLM by presenting an informative prompt. Instead of pre-defined templates for prompt construction, we use a pre-trained LLM to generate attribute-specific prompts for each visual category. Furthermore, we guide the VLM with context-aware discriminative information generated by the pre-trained LLM to emphasize features that distinguish the most probable candidate classes. The proposed ETL model is evaluated on 11 datasets and sets a new state of the art. Our code and all collected prompts are publicly available at https://github.com/mrzarei5/DATViL.</p>","PeriodicalId":54967,"journal":{"name":"International Journal of Computational Intelligence Systems","volume":"18 1","pages":"109"},"PeriodicalIF":2.9000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12077310/pdf/","citationCount":"0","resultStr":"{\"title\":\"Dual Adapter Tuning of Vision-Language Models Using Large Language Models.\",\"authors\":\"Mohammad Reza Zarei, Abbas Akkasi, Majid Komeili\",\"doi\":\"10.1007/s44196-025-00853-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Vision-language models (VLMs) pre-trained on large-scale image-text pairs have shown impressive results in zero-shot vision tasks. Knowledge transferability of these models can be further improved with the help of a limited number of samples. Feature adapter tuning is a prominent approach employed for efficient transfer learning (ETL). However, most of the previous ETL models focus on tuning either prior-independent or prior-dependent feature adapters. We propose a novel ETL approach that leverages both adapter styles simultaneously. Additionally, most existing ETL models rely on using textual prompts constructed by completing general pre-defined templates. This approach neglects the descriptive knowledge that can assist VLM by presenting an informative prompt. Instead of pre-defined templates for prompt construction, we use a pre-trained LLM to generate attribute-specific prompts for each visual category. Furthermore, we guide the VLM with context-aware discriminative information generated by the pre-trained LLM to emphasize features that distinguish the most probable candidate classes. The proposed ETL model is evaluated on 11 datasets and sets a new state of the art. Our code and all collected prompts are publicly available at https://github.com/mrzarei5/DATViL.</p>\",\"PeriodicalId\":54967,\"journal\":{\"name\":\"International Journal of Computational Intelligence Systems\",\"volume\":\"18 1\",\"pages\":\"109\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12077310/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Computational Intelligence Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s44196-025-00853-0\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/5/8 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computational Intelligence Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s44196-025-00853-0","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/8 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
Dual Adapter Tuning of Vision-Language Models Using Large Language Models.
Vision-language models (VLMs) pre-trained on large-scale image-text pairs have shown impressive results in zero-shot vision tasks. Knowledge transferability of these models can be further improved with the help of a limited number of samples. Feature adapter tuning is a prominent approach employed for efficient transfer learning (ETL). However, most of the previous ETL models focus on tuning either prior-independent or prior-dependent feature adapters. We propose a novel ETL approach that leverages both adapter styles simultaneously. Additionally, most existing ETL models rely on using textual prompts constructed by completing general pre-defined templates. This approach neglects the descriptive knowledge that can assist VLM by presenting an informative prompt. Instead of pre-defined templates for prompt construction, we use a pre-trained LLM to generate attribute-specific prompts for each visual category. Furthermore, we guide the VLM with context-aware discriminative information generated by the pre-trained LLM to emphasize features that distinguish the most probable candidate classes. The proposed ETL model is evaluated on 11 datasets and sets a new state of the art. Our code and all collected prompts are publicly available at https://github.com/mrzarei5/DATViL.
期刊介绍:
The International Journal of Computational Intelligence Systems publishes original research on all aspects of applied computational intelligence, especially targeting papers demonstrating the use of techniques and methods originating from computational intelligence theory. The core theories of computational intelligence are fuzzy logic, neural networks, evolutionary computation and probabilistic reasoning. The journal publishes only articles related to the use of computational intelligence and broadly covers the following topics:
-Autonomous reasoning-
Bio-informatics-
Cloud computing-
Condition monitoring-
Data science-
Data mining-
Data visualization-
Decision support systems-
Fault diagnosis-
Intelligent information retrieval-
Human-machine interaction and interfaces-
Image processing-
Internet and networks-
Noise analysis-
Pattern recognition-
Prediction systems-
Power (nuclear) safety systems-
Process and system control-
Real-time systems-
Risk analysis and safety-related issues-
Robotics-
Signal and image processing-
IoT and smart environments-
Systems integration-
System control-
System modelling and optimization-
Telecommunications-
Time series prediction-
Warning systems-
Virtual reality-
Web intelligence-
Deep learning