Guangyi Xu, Junyong Ye, Xinyuan Liu, Xubin Wen, Youwei Li, Jingjing Wang
{"title":"Lv-Adapter: Adapting Vision Transformers for Visual Classification with Linear-layers and Vectors","authors":"Guangyi Xu, Junyong Ye, Xinyuan Liu, Xubin Wen, Youwei Li, Jingjing Wang","doi":"10.1016/j.cviu.2024.104049","DOIUrl":null,"url":null,"abstract":"<div><p>Large pre-trained models based on Vision Transformers (ViTs) contain nearly billions of parameters, demanding substantial computational resources and storage space. This restricts their transferability across different tasks. Recent approaches try to use adapter fine-tuning to address this drawback. However, there is still potential to improve the number of tunable parameters and the accuracy in these methods. To address this challenge, we propose an adapter fine-tuning module called Lv-Adapter, which consists of a linear layer and vector. This module enables targeted parameter fine-tuning of pretrained models by learning both the prior knowledge of pre-trained task and the information from downstream specific task, to adapt to various downstream tasks in image and video tasks while transfer learning. Compared to full fine-tuning methods, Lv-Adapter has several appealing advantages. Firstly, by adding only about 3% extra parameters to ViT, Lv-Adapter achieves comparable accuracy to full fine-tuning methods and even significantly surpasses them on action recognition benchmarks. Secondly, Lv-Adapter is a lightweight module that can be plug-and-play in different transformer models due to its simplicity. Finally, to validate these claims, extensive experiments were conducted on five image and video datasets in this study, providing evidence for the effectiveness of Lv-Adapter. When only 3.5% of the extra parameters are updated, it respectively achieves a relative boost of about 13% and 24% compared to the fully fine-tuned model on SSv2 and HMDB51.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224001309","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Large pre-trained models based on Vision Transformers (ViTs) contain nearly billions of parameters, demanding substantial computational resources and storage space. This restricts their transferability across different tasks. Recent approaches try to use adapter fine-tuning to address this drawback. However, there is still potential to improve the number of tunable parameters and the accuracy in these methods. To address this challenge, we propose an adapter fine-tuning module called Lv-Adapter, which consists of a linear layer and vector. This module enables targeted parameter fine-tuning of pretrained models by learning both the prior knowledge of pre-trained task and the information from downstream specific task, to adapt to various downstream tasks in image and video tasks while transfer learning. Compared to full fine-tuning methods, Lv-Adapter has several appealing advantages. Firstly, by adding only about 3% extra parameters to ViT, Lv-Adapter achieves comparable accuracy to full fine-tuning methods and even significantly surpasses them on action recognition benchmarks. Secondly, Lv-Adapter is a lightweight module that can be plug-and-play in different transformer models due to its simplicity. Finally, to validate these claims, extensive experiments were conducted on five image and video datasets in this study, providing evidence for the effectiveness of Lv-Adapter. When only 3.5% of the extra parameters are updated, it respectively achieves a relative boost of about 13% and 24% compared to the fully fine-tuned model on SSv2 and HMDB51.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems