Run Shao;Cheng Yang;Qiujun Li;Linrui Xu;Xiang Yang;Xian Li;Mengyao Li;Qing Zhu;Yongjun Zhang;Yansheng Li;Yu Liu;Yong Tang;Dapeng Liu;Shizhong Yang;Haifeng Li
{"title":"AllSpark: A Multimodal Spatiotemporal General Intelligence Model With Ten Modalities via Language as a Reference Framework","authors":"Run Shao;Cheng Yang;Qiujun Li;Linrui Xu;Xiang Yang;Xian Li;Mengyao Li;Qing Zhu;Yongjun Zhang;Yansheng Li;Yu Liu;Yong Tang;Dapeng Liu;Shizhong Yang;Haifeng Li","doi":"10.1109/TGRS.2025.3526725","DOIUrl":null,"url":null,"abstract":"RGB, multispectral, point, and other spatiotemporal modal data fundamentally represent different observational approaches for the same geographic object. Therefore, leveraging multimodal data is an inherent requirement for comprehending geographic objects. However, due to the high heterogeneity in structure and semantics among various spatiotemporal modalities, the joint interpretation of multimodal spatiotemporal data has long been an extremely challenging problem. The primary challenge resides in striking a trade-off between the cohesion and autonomy of diverse modalities. This trade-off becomes progressively nonlinear as the number of modalities expands. Inspired by the human cognitive system and linguistic philosophy, where perceptual signals from the five senses converge into language, we introduce the language as reference framework (LaRF), a fundamental principle for constructing a multimodal unified model. Building upon this, we propose AllSpark, a multimodal spatiotemporal general artificial intelligence model. Our model integrates ten different modalities into a unified framework, including 1-D (language, code, and table), 2-D (RGB, synthetic aperture radar (SAR), multispectral, hyperspectral, graph, and trajectory), and 3-D (point cloud) modalities. To achieve modal cohesion, AllSpark introduces a modal bridge and multimodal large language model (LLM) to map diverse modal features into the language feature space. To maintain modality autonomy, AllSpark uses modality-specific encoders to extract the tokens of various spatiotemporal modalities. Finally, observing a gap between the model’s interpretability and downstream tasks, we designed modality-specific prompts and task heads, enhancing the model’s generalization capability across specific tasks. Experiments indicate that the incorporation of language enables AllSpark to excel in few-shot classification tasks for RGB and point cloud modalities without additional training, surpassing baseline performance by up to 41.82%. Additionally, AllSpark, despite lacking expert knowledge in most spatiotemporal modalities and utilizing a unified structure, demonstrates strong adaptability across ten modalities. LaRF and AllSpark contribute to the shift in the research paradigm in spatiotemporal intelligence, transitioning from a modality-specific and task-specific paradigm to a general paradigm. The source code is available at <uri>https://github.com/GeoX-Lab/AllSpark</uri>.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-20"},"PeriodicalIF":8.6000,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10830573/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
RGB, multispectral, point, and other spatiotemporal modal data fundamentally represent different observational approaches for the same geographic object. Therefore, leveraging multimodal data is an inherent requirement for comprehending geographic objects. However, due to the high heterogeneity in structure and semantics among various spatiotemporal modalities, the joint interpretation of multimodal spatiotemporal data has long been an extremely challenging problem. The primary challenge resides in striking a trade-off between the cohesion and autonomy of diverse modalities. This trade-off becomes progressively nonlinear as the number of modalities expands. Inspired by the human cognitive system and linguistic philosophy, where perceptual signals from the five senses converge into language, we introduce the language as reference framework (LaRF), a fundamental principle for constructing a multimodal unified model. Building upon this, we propose AllSpark, a multimodal spatiotemporal general artificial intelligence model. Our model integrates ten different modalities into a unified framework, including 1-D (language, code, and table), 2-D (RGB, synthetic aperture radar (SAR), multispectral, hyperspectral, graph, and trajectory), and 3-D (point cloud) modalities. To achieve modal cohesion, AllSpark introduces a modal bridge and multimodal large language model (LLM) to map diverse modal features into the language feature space. To maintain modality autonomy, AllSpark uses modality-specific encoders to extract the tokens of various spatiotemporal modalities. Finally, observing a gap between the model’s interpretability and downstream tasks, we designed modality-specific prompts and task heads, enhancing the model’s generalization capability across specific tasks. Experiments indicate that the incorporation of language enables AllSpark to excel in few-shot classification tasks for RGB and point cloud modalities without additional training, surpassing baseline performance by up to 41.82%. Additionally, AllSpark, despite lacking expert knowledge in most spatiotemporal modalities and utilizing a unified structure, demonstrates strong adaptability across ten modalities. LaRF and AllSpark contribute to the shift in the research paradigm in spatiotemporal intelligence, transitioning from a modality-specific and task-specific paradigm to a general paradigm. The source code is available at https://github.com/GeoX-Lab/AllSpark.
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.