{"title":"解决多模态大语言模型训练中的模型和数据异质性问题","authors":"Zili Zhang, Yinmin Zhong, Ranchen Ming, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Xin Jin","doi":"arxiv-2408.04275","DOIUrl":null,"url":null,"abstract":"Multimodal large language models (LLMs) have demonstrated significant\npotential in a wide range of AI applications. Yet, training multimodal LLMs\nsuffers from low efficiency and scalability, due to the inherent model\nheterogeneity and data heterogeneity across different modalities. We present MMScale, an efficient and adaptive framework to reform the\ntraining of multimodal large language models on large-scale clusters. MMScale\nexploits the system characteristics of multimodal LLM training to achieve high\nefficiency and scalability. The core of MMScale is the adaptive resource\nallocation and data-aware reordering techniques to eliminate the model and data\nheterogeneity respectively. We also tailor system optimizations for multimodal\nLLM training to offload certain operations from the GPU training. We evaluate\nMMScale across different sizes of multimodal LLMs on a large-scale production\ncluster with thousands of GPUs. The experimental results show that MMScale\nachieves 54.7% Model FLOPs Utilization (MFU) when training a 72B multimodal LLM\non 1172 GPUs and outperforms Megatron-LM by up to 2.2$\\times$ on throughput.\nThe ablation study shows the main techniques of MMScale are both effective and\nlightweight.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"119 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Addressing Model and Data Heterogeneity in Multimodal Large Language Model Training\",\"authors\":\"Zili Zhang, Yinmin Zhong, Ranchen Ming, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Xin Jin\",\"doi\":\"arxiv-2408.04275\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal large language models (LLMs) have demonstrated significant\\npotential in a wide range of AI applications. Yet, training multimodal LLMs\\nsuffers from low efficiency and scalability, due to the inherent model\\nheterogeneity and data heterogeneity across different modalities. We present MMScale, an efficient and adaptive framework to reform the\\ntraining of multimodal large language models on large-scale clusters. MMScale\\nexploits the system characteristics of multimodal LLM training to achieve high\\nefficiency and scalability. The core of MMScale is the adaptive resource\\nallocation and data-aware reordering techniques to eliminate the model and data\\nheterogeneity respectively. We also tailor system optimizations for multimodal\\nLLM training to offload certain operations from the GPU training. We evaluate\\nMMScale across different sizes of multimodal LLMs on a large-scale production\\ncluster with thousands of GPUs. The experimental results show that MMScale\\nachieves 54.7% Model FLOPs Utilization (MFU) when training a 72B multimodal LLM\\non 1172 GPUs and outperforms Megatron-LM by up to 2.2$\\\\times$ on throughput.\\nThe ablation study shows the main techniques of MMScale are both effective and\\nlightweight.\",\"PeriodicalId\":501422,\"journal\":{\"name\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"volume\":\"119 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.04275\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.04275","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Addressing Model and Data Heterogeneity in Multimodal Large Language Model Training
Multimodal large language models (LLMs) have demonstrated significant
potential in a wide range of AI applications. Yet, training multimodal LLMs
suffers from low efficiency and scalability, due to the inherent model
heterogeneity and data heterogeneity across different modalities. We present MMScale, an efficient and adaptive framework to reform the
training of multimodal large language models on large-scale clusters. MMScale
exploits the system characteristics of multimodal LLM training to achieve high
efficiency and scalability. The core of MMScale is the adaptive resource
allocation and data-aware reordering techniques to eliminate the model and data
heterogeneity respectively. We also tailor system optimizations for multimodal
LLM training to offload certain operations from the GPU training. We evaluate
MMScale across different sizes of multimodal LLMs on a large-scale production
cluster with thousands of GPUs. The experimental results show that MMScale
achieves 54.7% Model FLOPs Utilization (MFU) when training a 72B multimodal LLM
on 1172 GPUs and outperforms Megatron-LM by up to 2.2$\times$ on throughput.
The ablation study shows the main techniques of MMScale are both effective and
lightweight.