{"title":"CPAL: Cross-Prompting Adapter With LoRAs for RGB+X Semantic Segmentation","authors":"Ye Liu;Pengfei Wu;Miaohui Wang;Jun Liu","doi":"10.1109/TCSVT.2025.3536086","DOIUrl":null,"url":null,"abstract":"As sensor technology evolves, RGB+X systems combine traditional RGB cameras with another type of auxiliary sensor, which enhances perception capabilities and provides richer information for important tasks such as semantic segmentation. However, acquiring massive RGB+X data is difficult due to the need for specific acquisition equipment. Therefore, traditional RGB+X segmentation methods often perform pretraining on relatively abundant RGB data. However, these methods lack corresponding mechanisms to fully exploit the pretrained model, and the scope of the pretraining RGB dataset remains limited. Recent works have employed prompt learning to tap into the potential of pretrained foundation models, but these methods adopt a unidirectional prompting approach i.e., using X or RGB+X modality to prompt pretrained foundation models in RGB modality, neglecting the potential in non-RGB modalities. In this paper, we are dedicated to developing the potential of pretrained foundation models in both RGB and non-RGB modalities simultaneously, which is non-trivial due to the semantic gap between modalities. Specifically, we present the CPAL (Cross-prompting Adapter with LoRAs), a framework that features a novel bi-directional adapter to simultaneously fully exploit the complementarity and bridging the semantic gap between modalities. Additionally, CPAL introduces low-rank adaption (LoRA) to fine-tune the foundation model of each modal. With the support of these elements, we have successfully unleashed the potential of RGB foundation models in both RGB and non-RGB modalities simultaneously. Our method achieves state-of-the-art (SOTA) performance on five multi-modal benchmarks, including RGB+Depth, RGB+Thermal, RGB+Event, and a multi-modal video object segmentation benchmark, as well as four multi-modal salient object detection benchmarks. The code and results are available at: <uri>https://github.com/abelny56/CPAL</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5858-5871"},"PeriodicalIF":11.1000,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10857375/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
As sensor technology evolves, RGB+X systems combine traditional RGB cameras with another type of auxiliary sensor, which enhances perception capabilities and provides richer information for important tasks such as semantic segmentation. However, acquiring massive RGB+X data is difficult due to the need for specific acquisition equipment. Therefore, traditional RGB+X segmentation methods often perform pretraining on relatively abundant RGB data. However, these methods lack corresponding mechanisms to fully exploit the pretrained model, and the scope of the pretraining RGB dataset remains limited. Recent works have employed prompt learning to tap into the potential of pretrained foundation models, but these methods adopt a unidirectional prompting approach i.e., using X or RGB+X modality to prompt pretrained foundation models in RGB modality, neglecting the potential in non-RGB modalities. In this paper, we are dedicated to developing the potential of pretrained foundation models in both RGB and non-RGB modalities simultaneously, which is non-trivial due to the semantic gap between modalities. Specifically, we present the CPAL (Cross-prompting Adapter with LoRAs), a framework that features a novel bi-directional adapter to simultaneously fully exploit the complementarity and bridging the semantic gap between modalities. Additionally, CPAL introduces low-rank adaption (LoRA) to fine-tune the foundation model of each modal. With the support of these elements, we have successfully unleashed the potential of RGB foundation models in both RGB and non-RGB modalities simultaneously. Our method achieves state-of-the-art (SOTA) performance on five multi-modal benchmarks, including RGB+Depth, RGB+Thermal, RGB+Event, and a multi-modal video object segmentation benchmark, as well as four multi-modal salient object detection benchmarks. The code and results are available at: https://github.com/abelny56/CPAL.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.