{"title":"Uni-MoE: Scaling Unified Multimodal LLMs With Mixture of Experts","authors":"Yunxin Li;Shenyuan Jiang;Baotian Hu;Longyue Wang;Wanqi Zhong;Wenhan Luo;Lin Ma;Min Zhang","doi":"10.1109/TPAMI.2025.3532688","DOIUrl":null,"url":null,"abstract":"Recent advancements in Multimodal Large Language Models (MLLMs) underscore the significance of scalable models and data to boost performance, yet this often incurs substantial computational costs. Although the Mixture of Experts (MoE) architecture has been employed to scale large language or visual-language models efficiently, these efforts typically involve fewer experts and limited modalities. To address this, our work presents the pioneering attempt to develop a unified MLLM with the MoE architecture, named <bold>Uni-MoE</b> that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We also implement a sparse MoE architecture within the LLMs to enable efficient training and inference through modality-level data parallelism and expert-level model parallelism. To enhance the multi-expert collaboration and generalization, we present a progressive training strategy: 1) Cross-modality alignment using various connectors with different cross-modality data, 2) Training modality-specific experts with cross-modality instruction data to activate experts’ preferences, and 3) Tuning the whole Uni-MoE framework utilizing Low-Rank Adaptation (LoRA) on mixed multimodal instruction data. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets. The extensive experimental results demonstrate Uni-MoE's principal advantage of significantly reducing performance bias in handling mixed multimodal datasets, alongside improved multi-expert collaboration and generalization.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3424-3439"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10887014/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) underscore the significance of scalable models and data to boost performance, yet this often incurs substantial computational costs. Although the Mixture of Experts (MoE) architecture has been employed to scale large language or visual-language models efficiently, these efforts typically involve fewer experts and limited modalities. To address this, our work presents the pioneering attempt to develop a unified MLLM with the MoE architecture, named Uni-MoE that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We also implement a sparse MoE architecture within the LLMs to enable efficient training and inference through modality-level data parallelism and expert-level model parallelism. To enhance the multi-expert collaboration and generalization, we present a progressive training strategy: 1) Cross-modality alignment using various connectors with different cross-modality data, 2) Training modality-specific experts with cross-modality instruction data to activate experts’ preferences, and 3) Tuning the whole Uni-MoE framework utilizing Low-Rank Adaptation (LoRA) on mixed multimodal instruction data. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets. The extensive experimental results demonstrate Uni-MoE's principal advantage of significantly reducing performance bias in handling mixed multimodal datasets, alongside improved multi-expert collaboration and generalization.
多模态大型语言模型(Multimodal Large Language Models, mllm)的最新进展强调了可扩展模型和数据对提高性能的重要性,但这通常会导致大量的计算成本。尽管专家混合(MoE)架构已被用于有效地扩展大型语言或视觉语言模型,但这些努力通常涉及较少的专家和有限的模式。为了解决这个问题,我们的工作提出了开发具有MoE架构的统一mlm的开创性尝试,称为Uni-MoE,可以处理各种模态。具体来说,它具有特定于模态的编码器和用于统一多模态表示的连接器。我们还在llm中实现了一个稀疏的MoE架构,通过模态级数据并行和专家级模型并行来实现高效的训练和推理。为了增强多专家协作和泛化能力,本文提出了一种进步式训练策略:1)使用不同跨模态数据的各种连接器进行跨模态对齐;2)使用跨模态指令数据训练特定模态的专家以激活专家的偏好;3)在混合多模态指令数据上使用低秩自适应(Low-Rank Adaptation, LoRA)对整个Uni-MoE框架进行调优。我们在一组全面的多模态数据集上评估了指令调谐的Uni-MoE。广泛的实验结果表明,Uni-MoE的主要优势是在处理混合多模态数据集时显著减少了性能偏差,同时改进了多专家协作和泛化。