Kang Wu, Yingying Zhang, Lixiang Ru, Bo Dang, Jiangwei Lao, Lei Yu, Junwei Luo, Zifan Zhu, Yue Sun, Jiahao Zhang, Qi Zhu, Jian Wang, Ming Yang, Jingdong Chen, Yongjun Zhang, Yansheng Li
{"title":"A semantic-enhanced multi-modal remote sensing foundation model for Earth observation","authors":"Kang Wu, Yingying Zhang, Lixiang Ru, Bo Dang, Jiangwei Lao, Lei Yu, Junwei Luo, Zifan Zhu, Yue Sun, Jiahao Zhang, Qi Zhu, Jian Wang, Ming Yang, Jingdong Chen, Yongjun Zhang, Yansheng Li","doi":"10.1038/s42256-025-01078-8","DOIUrl":null,"url":null,"abstract":"Remote sensing foundation models, pretrained on massive remote sensing data, have shown impressive performance in several Earth observation (EO) tasks. These models usually use single-modal temporal data for pretraining, which is insufficient for multi-modal applications. Moreover, these models require a considerable number of samples for fine-tuning in downstream tasks, posing challenges in time-sensitive scenarios, such as rapid flood mapping. We present SkySense++, a multi-modal remote sensing foundation model for diverse EO tasks. SkySense++ has a factorized architecture to accommodate multi-modal images acquired by diverse sensors. We adopt progressive pretraining, which involves two stages, on meticulously curated datasets of 27 million multi-modal remote sensing images. The first representation-enhanced pretraining stage uses multi-granularity contrastive learning to obtain general representations. The second semantic-enhanced pretraining stage leverages masked semantic learning to learn semantically enriched representations, enabling few-shot capabilities. This ability allows the model to handle unseen tasks with minimal labelled data, alleviating the need for fine-tuning on extensive annotated data. SkySense++ demonstrates consistent improvements in classification, detection and segmentation over previous state-of-the-art models across 12 EO tasks in 7 domains: agriculture, forestry, oceanography, atmosphere, biology, land surveying and disaster management. This generalizability may lead to a new chapter of remote sensing foundation model applications for EO tasks at scale. Wu et al. developed SkySense++, a multi-modal remote sensing foundation model pretrained on 27 million multi-modal images, which achieved robust generalization and few-shot capabilities across several Earth observation tasks and domains, including agriculture and disaster management.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 8","pages":"1235-1249"},"PeriodicalIF":23.9000,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.nature.com/articles/s42256-025-01078-8","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Remote sensing foundation models, pretrained on massive remote sensing data, have shown impressive performance in several Earth observation (EO) tasks. These models usually use single-modal temporal data for pretraining, which is insufficient for multi-modal applications. Moreover, these models require a considerable number of samples for fine-tuning in downstream tasks, posing challenges in time-sensitive scenarios, such as rapid flood mapping. We present SkySense++, a multi-modal remote sensing foundation model for diverse EO tasks. SkySense++ has a factorized architecture to accommodate multi-modal images acquired by diverse sensors. We adopt progressive pretraining, which involves two stages, on meticulously curated datasets of 27 million multi-modal remote sensing images. The first representation-enhanced pretraining stage uses multi-granularity contrastive learning to obtain general representations. The second semantic-enhanced pretraining stage leverages masked semantic learning to learn semantically enriched representations, enabling few-shot capabilities. This ability allows the model to handle unseen tasks with minimal labelled data, alleviating the need for fine-tuning on extensive annotated data. SkySense++ demonstrates consistent improvements in classification, detection and segmentation over previous state-of-the-art models across 12 EO tasks in 7 domains: agriculture, forestry, oceanography, atmosphere, biology, land surveying and disaster management. This generalizability may lead to a new chapter of remote sensing foundation model applications for EO tasks at scale. Wu et al. developed SkySense++, a multi-modal remote sensing foundation model pretrained on 27 million multi-modal images, which achieved robust generalization and few-shot capabilities across several Earth observation tasks and domains, including agriculture and disaster management.
期刊介绍:
Nature Machine Intelligence is a distinguished publication that presents original research and reviews on various topics in machine learning, robotics, and AI. Our focus extends beyond these fields, exploring their profound impact on other scientific disciplines, as well as societal and industrial aspects. We recognize limitless possibilities wherein machine intelligence can augment human capabilities and knowledge in domains like scientific exploration, healthcare, medical diagnostics, and the creation of safe and sustainable cities, transportation, and agriculture. Simultaneously, we acknowledge the emergence of ethical, social, and legal concerns due to the rapid pace of advancements.
To foster interdisciplinary discussions on these far-reaching implications, Nature Machine Intelligence serves as a platform for dialogue facilitated through Comments, News Features, News & Views articles, and Correspondence. Our goal is to encourage a comprehensive examination of these subjects.
Similar to all Nature-branded journals, Nature Machine Intelligence operates under the guidance of a team of skilled editors. We adhere to a fair and rigorous peer-review process, ensuring high standards of copy-editing and production, swift publication, and editorial independence.