AllSpark: A Multimodal Spatiotemporal General Intelligence Model With Ten Modalities via Language as a Reference Framework

IF 8.6 1区地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Geoscience and Remote Sensing Pub Date : 2025-01-07 DOI:10.1109/TGRS.2025.3526725

Run Shao;Cheng Yang;Qiujun Li;Linrui Xu;Xiang Yang;Xian Li;Mengyao Li;Qing Zhu;Yongjun Zhang;Yansheng Li;Yu Liu;Yong Tang;Dapeng Liu;Shizhong Yang;Haifeng Li

{"title":"AllSpark: A Multimodal Spatiotemporal General Intelligence Model With Ten Modalities via Language as a Reference Framework","authors":"Run Shao;Cheng Yang;Qiujun Li;Linrui Xu;Xiang Yang;Xian Li;Mengyao Li;Qing Zhu;Yongjun Zhang;Yansheng Li;Yu Liu;Yong Tang;Dapeng Liu;Shizhong Yang;Haifeng Li","doi":"10.1109/TGRS.2025.3526725","DOIUrl":null,"url":null,"abstract":"RGB, multispectral, point, and other spatiotemporal modal data fundamentally represent different observational approaches for the same geographic object. Therefore, leveraging multimodal data is an inherent requirement for comprehending geographic objects. However, due to the high heterogeneity in structure and semantics among various spatiotemporal modalities, the joint interpretation of multimodal spatiotemporal data has long been an extremely challenging problem. The primary challenge resides in striking a trade-off between the cohesion and autonomy of diverse modalities. This trade-off becomes progressively nonlinear as the number of modalities expands. Inspired by the human cognitive system and linguistic philosophy, where perceptual signals from the five senses converge into language, we introduce the language as reference framework (LaRF), a fundamental principle for constructing a multimodal unified model. Building upon this, we propose AllSpark, a multimodal spatiotemporal general artificial intelligence model. Our model integrates ten different modalities into a unified framework, including 1-D (language, code, and table), 2-D (RGB, synthetic aperture radar (SAR), multispectral, hyperspectral, graph, and trajectory), and 3-D (point cloud) modalities. To achieve modal cohesion, AllSpark introduces a modal bridge and multimodal large language model (LLM) to map diverse modal features into the language feature space. To maintain modality autonomy, AllSpark uses modality-specific encoders to extract the tokens of various spatiotemporal modalities. Finally, observing a gap between the model’s interpretability and downstream tasks, we designed modality-specific prompts and task heads, enhancing the model’s generalization capability across specific tasks. Experiments indicate that the incorporation of language enables AllSpark to excel in few-shot classification tasks for RGB and point cloud modalities without additional training, surpassing baseline performance by up to 41.82%. Additionally, AllSpark, despite lacking expert knowledge in most spatiotemporal modalities and utilizing a unified structure, demonstrates strong adaptability across ten modalities. LaRF and AllSpark contribute to the shift in the research paradigm in spatiotemporal intelligence, transitioning from a modality-specific and task-specific paradigm to a general paradigm. The source code is available at <uri>https://github.com/GeoX-Lab/AllSpark</uri>.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-20"},"PeriodicalIF":8.6000,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10830573/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

RGB, multispectral, point, and other spatiotemporal modal data fundamentally represent different observational approaches for the same geographic object. Therefore, leveraging multimodal data is an inherent requirement for comprehending geographic objects. However, due to the high heterogeneity in structure and semantics among various spatiotemporal modalities, the joint interpretation of multimodal spatiotemporal data has long been an extremely challenging problem. The primary challenge resides in striking a trade-off between the cohesion and autonomy of diverse modalities. This trade-off becomes progressively nonlinear as the number of modalities expands. Inspired by the human cognitive system and linguistic philosophy, where perceptual signals from the five senses converge into language, we introduce the language as reference framework (LaRF), a fundamental principle for constructing a multimodal unified model. Building upon this, we propose AllSpark, a multimodal spatiotemporal general artificial intelligence model. Our model integrates ten different modalities into a unified framework, including 1-D (language, code, and table), 2-D (RGB, synthetic aperture radar (SAR), multispectral, hyperspectral, graph, and trajectory), and 3-D (point cloud) modalities. To achieve modal cohesion, AllSpark introduces a modal bridge and multimodal large language model (LLM) to map diverse modal features into the language feature space. To maintain modality autonomy, AllSpark uses modality-specific encoders to extract the tokens of various spatiotemporal modalities. Finally, observing a gap between the model’s interpretability and downstream tasks, we designed modality-specific prompts and task heads, enhancing the model’s generalization capability across specific tasks. Experiments indicate that the incorporation of language enables AllSpark to excel in few-shot classification tasks for RGB and point cloud modalities without additional training, surpassing baseline performance by up to 41.82%. Additionally, AllSpark, despite lacking expert knowledge in most spatiotemporal modalities and utilizing a unified structure, demonstrates strong adaptability across ten modalities. LaRF and AllSpark contribute to the shift in the research paradigm in spatiotemporal intelligence, transitioning from a modality-specific and task-specific paradigm to a general paradigm. The source code is available at https://github.com/GeoX-Lab/AllSpark.

查看原文本刊更多论文

基于语言作为参考框架的多模态时空通用智能模型

RGB、多光谱、点和其他时空模态数据基本上代表了同一地理目标的不同观测方法。因此，利用多模态数据是理解地理对象的内在要求。然而，由于各种时空模态在结构和语义上的高度异质性，多模态时空数据的联合解释一直是一个极具挑战性的问题。主要的挑战在于在不同模式的凝聚力和自主性之间进行权衡。随着模态数量的增加，这种权衡逐渐变得非线性。受人类认知系统和语言哲学的启发，我们引入了语言作为参考框架（LaRF），这是构建多模态统一模型的基本原则。在此基础上，我们提出了一个多模态时空通用人工智能模型AllSpark。我们的模型将十种不同的模态集成到一个统一的框架中，包括一维（语言、代码和表格）、二维（RGB、合成孔径雷达（SAR）、多光谱、高光谱、图形和轨迹）和三维（点云）模态。为了实现模态内聚，AllSpark引入了模态桥和多模态大语言模型（LLM），将不同的模态特征映射到语言特征空间中。为了保持模态自主性，AllSpark使用特定于模态的编码器来提取各种时空模态的令牌。最后，观察到模型的可解释性和下游任务之间的差距，我们设计了特定于模态的提示和任务头，增强了模型跨特定任务的泛化能力。实验表明，在没有额外训练的情况下，语言的结合使AllSpark在RGB和点云模式的少量分类任务中表现出色，比基线性能高出41.82%。此外，尽管在大多数时空模式中缺乏专业知识，并且使用统一的结构，但AllSpark在十种模式中表现出很强的适应性。LaRF和AllSpark促进了时空智能研究范式的转变，从特定模态和特定任务范式过渡到一般范式。源代码可从https://github.com/GeoX-Lab/AllSpark获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Geoscience and Remote Sensing 工程技术-地球化学与地球物理

CiteScore

11.50

自引率

28.00%

发文量

1912

审稿时长

4.0 months

期刊介绍： IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.