Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang
{"title":"SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality","authors":"Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang","doi":"arxiv-2409.08083","DOIUrl":null,"url":null,"abstract":"Foundation models like ChatGPT and Sora that are trained on a huge scale of\ndata have made a revolutionary social impact. However, it is extremely\nchallenging for sensors in many different fields to collect similar scales of\nnatural images to train strong foundation models. To this end, this work\npresents a simple and effective framework SimMAT to study an open problem: the\ntransferability from vision foundation models trained on natural RGB images to\nother image modalities of different physical properties (e.g., polarization).\nSimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained\nfoundation model. We apply SimMAT to a representative vision foundation model\nSegment Anything Model (SAM) to support any evaluated new image modality. Given\nthe absence of relevant benchmarks, we construct a new benchmark to evaluate\nthe transfer learning performance. Our experiments confirm the intriguing\npotential of transferring vision foundation models in enhancing other sensors'\nperformance. Specifically, SimMAT can improve the segmentation performance\n(mIoU) from 22.15% to 53.88% on average for evaluated modalities and\nconsistently outperforms other baselines. We hope that SimMAT can raise\nawareness of cross-modal transfer learning and benefit various fields for\nbetter results with vision foundation models.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08083","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Foundation models like ChatGPT and Sora that are trained on a huge scale of
data have made a revolutionary social impact. However, it is extremely
challenging for sensors in many different fields to collect similar scales of
natural images to train strong foundation models. To this end, this work
presents a simple and effective framework SimMAT to study an open problem: the
transferability from vision foundation models trained on natural RGB images to
other image modalities of different physical properties (e.g., polarization).
SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained
foundation model. We apply SimMAT to a representative vision foundation model
Segment Anything Model (SAM) to support any evaluated new image modality. Given
the absence of relevant benchmarks, we construct a new benchmark to evaluate
the transfer learning performance. Our experiments confirm the intriguing
potential of transferring vision foundation models in enhancing other sensors'
performance. Specifically, SimMAT can improve the segmentation performance
(mIoU) from 22.15% to 53.88% on average for evaluated modalities and
consistently outperforms other baselines. We hope that SimMAT can raise
awareness of cross-modal transfer learning and benefit various fields for
better results with vision foundation models.