{"title":"Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation","authors":"Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, Minlan Yu","doi":"arxiv-2408.03505","DOIUrl":null,"url":null,"abstract":"Multimodal large language models (MLLMs) have extended the success of large\nlanguage models (LLMs) to multiple data types, such as image, text and audio,\nachieving significant performance in various domains, including multimodal\ntranslation, visual question answering and content generation. Nonetheless,\nexisting systems are inefficient to train MLLMs due to substantial GPU bubbles\ncaused by the heterogeneous modality models and complex data dependencies in 3D\nparallelism. This paper proposes Optimus, a distributed MLLM training system\nthat reduces end-to-end MLLM training time. Optimus is based on our principled\nanalysis that scheduling the encoder computation within the LLM bubbles can\nreduce bubbles in MLLM training. To make scheduling encoder computation\npossible for all GPUs, Optimus searches the separate parallel plans for encoder\nand LLM, and adopts a bubble scheduling algorithm to enable exploiting LLM\nbubbles without breaking the original data dependencies in the MLLM model\narchitecture. We further decompose encoder layer computation into a series of\nkernels, and analyze the common bubble pattern of 3D parallelism to carefully\noptimize the sub-millisecond bubble scheduling, minimizing the overall training\ntime. Our experiments in a production cluster show that Optimus accelerates\nMLLM training by 20.5%-21.3% with ViT-22B and GPT-175B model over 3072 GPUs\ncompared to baselines.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"13 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03505","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal large language models (MLLMs) have extended the success of large
language models (LLMs) to multiple data types, such as image, text and audio,
achieving significant performance in various domains, including multimodal
translation, visual question answering and content generation. Nonetheless,
existing systems are inefficient to train MLLMs due to substantial GPU bubbles
caused by the heterogeneous modality models and complex data dependencies in 3D
parallelism. This paper proposes Optimus, a distributed MLLM training system
that reduces end-to-end MLLM training time. Optimus is based on our principled
analysis that scheduling the encoder computation within the LLM bubbles can
reduce bubbles in MLLM training. To make scheduling encoder computation
possible for all GPUs, Optimus searches the separate parallel plans for encoder
and LLM, and adopts a bubble scheduling algorithm to enable exploiting LLM
bubbles without breaking the original data dependencies in the MLLM model
architecture. We further decompose encoder layer computation into a series of
kernels, and analyze the common bubble pattern of 3D parallelism to carefully
optimize the sub-millisecond bubble scheduling, minimizing the overall training
time. Our experiments in a production cluster show that Optimus accelerates
MLLM training by 20.5%-21.3% with ViT-22B and GPT-175B model over 3072 GPUs
compared to baselines.