arXiv - CS - Distributed, Parallel, and Cluster Computing最新文献_第10页

Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning 通过自动推理引擎调整实现 SLO 优化的 LLM 服务

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-08 DOI: arxiv-2408.04323

Ke Cheng, Zhi Wang, Wen Hu, Tiannuo Yang, Jianguo Li, Sheng Zhang

{"title":"Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning","authors":"Ke Cheng, Zhi Wang, Wen Hu, Tiannuo Yang, Jianguo Li, Sheng Zhang","doi":"arxiv-2408.04323","DOIUrl":"https://doi.org/arxiv-2408.04323","url":null,"abstract":"A service-level objective (SLO) is a target performance metric of service\u0000that cloud vendors aim to ensure. Delivering optimized SLOs can enhance user\u0000satisfaction and improve the competitiveness of cloud vendors. As large\u0000language models (LLMs) are gaining increasing popularity across various fields,\u0000it is of great significance to optimize SLOs for LLM inference services. In\u0000this paper, we observe that adjusting the parameters of LLM inference engines\u0000can improve service performance, and the optimal parameter configurations of\u0000different services are different. Therefore, we propose SCOOT, an automatic\u0000performance tuning system to optimize SLOs for each LLM inference service by\u0000tuning the parameters of the inference engine. We first propose a generalized\u0000formulation of the tuning problem to handle various objectives and constraints\u0000between parameters, and SCOOT exploits the Bayesian optimization (BO) technique\u0000to resolve the problem via exploration and exploitation. Moreover, SCOOT adopts\u0000a random forest to learn hidden constraints during the tuning process to\u0000mitigate invalid exploration. To improve the tuning efficiency, SCOOT utilizes\u0000the parallel suggestion to accelerate the tuning process. Extensive experiments\u0000demonstrate that SCOOT can significantly outperform existing tuning techniques\u0000in SLO optimization while greatly improving the tuning efficiency.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"184 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Addressing Model and Data Heterogeneity in Multimodal Large Language Model Training 解决多模态大语言模型训练中的模型和数据异质性问题

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-08 DOI: arxiv-2408.04275

Zili Zhang, Yinmin Zhong, Ranchen Ming, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Xin Jin

{"title":"Addressing Model and Data Heterogeneity in Multimodal Large Language Model Training","authors":"Zili Zhang, Yinmin Zhong, Ranchen Ming, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Xin Jin","doi":"arxiv-2408.04275","DOIUrl":"https://doi.org/arxiv-2408.04275","url":null,"abstract":"Multimodal large language models (LLMs) have demonstrated significant\u0000potential in a wide range of AI applications. Yet, training multimodal LLMs\u0000suffers from low efficiency and scalability, due to the inherent model\u0000heterogeneity and data heterogeneity across different modalities. We present MMScale, an efficient and adaptive framework to reform the\u0000training of multimodal large language models on large-scale clusters. MMScale\u0000exploits the system characteristics of multimodal LLM training to achieve high\u0000efficiency and scalability. The core of MMScale is the adaptive resource\u0000allocation and data-aware reordering techniques to eliminate the model and data\u0000heterogeneity respectively. We also tailor system optimizations for multimodal\u0000LLM training to offload certain operations from the GPU training. We evaluate\u0000MMScale across different sizes of multimodal LLMs on a large-scale production\u0000cluster with thousands of GPUs. The experimental results show that MMScale\u0000achieves 54.7% Model FLOPs Utilization (MFU) when training a 72B multimodal LLM\u0000on 1172 GPUs and outperforms Megatron-LM by up to 2.2$times$ on throughput.\u0000The ablation study shows the main techniques of MMScale are both effective and\u0000lightweight.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"119 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PeerSwap: A Peer-Sampler with Randomness Guarantees PeerSwap：具有随机性保证的对等取样器

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-07 DOI: arxiv-2408.03829

Rachid Guerraoui, Anne-Marie Kermarrec, Anastasiia Kucherenko, Rafael Pinot, Marijn de Vos

{"title":"PeerSwap: A Peer-Sampler with Randomness Guarantees","authors":"Rachid Guerraoui, Anne-Marie Kermarrec, Anastasiia Kucherenko, Rafael Pinot, Marijn de Vos","doi":"arxiv-2408.03829","DOIUrl":"https://doi.org/arxiv-2408.03829","url":null,"abstract":"The ability of a peer-to-peer (P2P) system to effectively host decentralized\u0000applications often relies on the availability of a peer-sampling service, which\u0000provides each participant with a random sample of other peers. Despite the\u0000practical effectiveness of existing peer samplers, their ability to produce\u0000random samples within a reasonable time frame remains poorly understood from a\u0000theoretical standpoint. This paper contributes to bridging this gap by\u0000introducing PeerSwap, a peer-sampling protocol with provable randomness\u0000guarantees. We establish execution time bounds for PeerSwap, demonstrating its\u0000ability to scale effectively with the network size. We prove that PeerSwap\u0000maintains the fixed structure of the communication graph while allowing\u0000sequential peer position swaps within this graph. We do so by showing that\u0000PeerSwap is a specific instance of an interchange process, a renowned model for\u0000particle movement analysis. Leveraging this mapping, we derive execution time\u0000bounds, expressed as a function of the network size N. Depending on the network\u0000structure, this time can be as low as a polylogarithmic function of N,\u0000highlighting the efficiency of PeerSwap. We implement PeerSwap and conduct\u0000numerical evaluations using regular graphs with varying connectivity and\u0000containing up to 32768 (2^15) peers. Our evaluation demonstrates that PeerSwap\u0000quickly provides peers with uniform random samples of other peers.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation Optimus：通过气泡开发加速大规模多模态 LLM 训练

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-07 DOI: arxiv-2408.03505

Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, Minlan Yu

{"title":"Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation","authors":"Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, Minlan Yu","doi":"arxiv-2408.03505","DOIUrl":"https://doi.org/arxiv-2408.03505","url":null,"abstract":"Multimodal large language models (MLLMs) have extended the success of large\u0000language models (LLMs) to multiple data types, such as image, text and audio,\u0000achieving significant performance in various domains, including multimodal\u0000translation, visual question answering and content generation. Nonetheless,\u0000existing systems are inefficient to train MLLMs due to substantial GPU bubbles\u0000caused by the heterogeneous modality models and complex data dependencies in 3D\u0000parallelism. This paper proposes Optimus, a distributed MLLM training system\u0000that reduces end-to-end MLLM training time. Optimus is based on our principled\u0000analysis that scheduling the encoder computation within the LLM bubbles can\u0000reduce bubbles in MLLM training. To make scheduling encoder computation\u0000possible for all GPUs, Optimus searches the separate parallel plans for encoder\u0000and LLM, and adopts a bubble scheduling algorithm to enable exploiting LLM\u0000bubbles without breaking the original data dependencies in the MLLM model\u0000architecture. We further decompose encoder layer computation into a series of\u0000kernels, and analyze the common bubble pattern of 3D parallelism to carefully\u0000optimize the sub-millisecond bubble scheduling, minimizing the overall training\u0000time. Our experiments in a production cluster show that Optimus accelerates\u0000MLLM training by 20.5%-21.3% with ViT-22B and GPT-175B model over 3072 GPUs\u0000compared to baselines.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Blockchain-based Reliable Federated Meta-learning for Metaverse: A Dual Game Framework 基于区块链的可靠联合元学习：双重游戏框架

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-07 DOI: arxiv-2408.03694

Emna Baccour, Aiman Erbad, Amr Mohamed, Mounir Hamdi, Mohsen Guizani

{"title":"A Blockchain-based Reliable Federated Meta-learning for Metaverse: A Dual Game Framework","authors":"Emna Baccour, Aiman Erbad, Amr Mohamed, Mounir Hamdi, Mohsen Guizani","doi":"arxiv-2408.03694","DOIUrl":"https://doi.org/arxiv-2408.03694","url":null,"abstract":"The metaverse, envisioned as the next digital frontier for avatar-based\u0000virtual interaction, involves high-performance models. In this dynamic\u0000environment, users' tasks frequently shift, requiring fast model\u0000personalization despite limited data. This evolution consumes extensive\u0000resources and requires vast data volumes. To address this, meta-learning\u0000emerges as an invaluable tool for metaverse users, with federated meta-learning\u0000(FML), offering even more tailored solutions owing to its adaptive\u0000capabilities. However, the metaverse is characterized by users heterogeneity\u0000with diverse data structures, varied tasks, and uneven sample sizes,\u0000potentially undermining global training outcomes due to statistical difference.\u0000Given this, an urgent need arises for smart coalition formation that accounts\u0000for these disparities. This paper introduces a dual game-theoretic framework\u0000for metaverse services involving meta-learners as workers to manage FML. A\u0000blockchain-based cooperative coalition formation game is crafted, grounded on a\u0000reputation metric, user similarity, and incentives. We also introduce a novel\u0000reputation system based on users' historical contributions and potential\u0000contributions to present tasks, leveraging correlations between past and new\u0000tasks. Finally, a Stackelberg game-based incentive mechanism is presented to\u0000attract reliable workers to participate in meta-learning, minimizing users'\u0000energy costs, increasing payoffs, boosting FML efficacy, and improving\u0000metaverse utility. Results show that our dual game framework outperforms\u0000best-effort, random, and non-uniform clustering schemes - improving training\u0000performance by up to 10%, cutting completion times by as much as 30%, enhancing\u0000metaverse utility by more than 25%, and offering up to 5% boost in training\u0000efficiency over non-blockchain systems, effectively countering misbehaving\u0000users.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The State of FaaS: An analysis of public Functions-as-a-Service providers FaaS 的现状：对公共功能即服务提供商的分析

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-06 DOI: arxiv-2408.03021

Nnamdi Ekwe-Ekwe, Lucas Amos

引用次数: 0

Reinforcement Learning based Workflow Scheduling in Cloud and Edge Computing Environments: A Taxonomy, Review and Future Directions 云计算和边缘计算环境中基于强化学习的工作流调度：分类、回顾与未来方向

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-06 DOI: arxiv-2408.02938

Amanda Jayanetti, Saman Halgamuge, Rajkumar Buyya

{"title":"Reinforcement Learning based Workflow Scheduling in Cloud and Edge Computing Environments: A Taxonomy, Review and Future Directions","authors":"Amanda Jayanetti, Saman Halgamuge, Rajkumar Buyya","doi":"arxiv-2408.02938","DOIUrl":"https://doi.org/arxiv-2408.02938","url":null,"abstract":"Deep Reinforcement Learning (DRL) techniques have been successfully applied\u0000for solving complex decision-making and control tasks in multiple fields\u0000including robotics, autonomous driving, healthcare and natural language\u0000processing. The ability of DRL agents to learn from experience and utilize\u0000real-time data for making decisions makes it an ideal candidate for dealing\u0000with the complexities associated with the problem of workflow scheduling in\u0000highly dynamic cloud and edge computing environments. Despite the benefits of\u0000DRL, there are multiple challenges associated with the application of DRL\u0000techniques including multi-objectivity, curse of dimensionality, partial\u0000observability and multi-agent coordination. In this paper, we comprehensively\u0000analyze the challenges and opportunities associated with the design and\u0000implementation of DRL oriented solutions for workflow scheduling in cloud and\u0000edge computing environments. Based on the identified characteristics, we\u0000propose a taxonomy of workflow scheduling with DRL. We map reviewed works with\u0000respect to the taxonomy to identify their strengths and weaknesses. Based on\u0000taxonomy driven analysis, we propose novel future research directions for the\u0000field.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"374 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Deep Reinforcement Learning Approach for Cost Optimized Workflow Scheduling in Cloud Computing Environments 云计算环境中成本优化工作流调度的深度强化学习方法

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-06 DOI: arxiv-2408.02926

Amanda Jayanetti, Saman Halgamuge, Rajkumar Buyya

{"title":"A Deep Reinforcement Learning Approach for Cost Optimized Workflow Scheduling in Cloud Computing Environments","authors":"Amanda Jayanetti, Saman Halgamuge, Rajkumar Buyya","doi":"arxiv-2408.02926","DOIUrl":"https://doi.org/arxiv-2408.02926","url":null,"abstract":"Cost optimization is a common goal of workflow schedulers operating in cloud\u0000computing environments. The use of spot instances is a potential means of\u0000achieving this goal, as they are offered by cloud providers at discounted\u0000prices compared to their on-demand counterparts in exchange for reduced\u0000reliability. This is due to the fact that spot instances are subjected to\u0000interruptions when spare computing capacity used for provisioning them is\u0000needed back owing to demand variations. Also, the prices of spot instances are\u0000not fixed as pricing is dependent on long term supply and demand. The\u0000possibility of interruptions and pricing variations associated with spot\u0000instances adds a layer of uncertainty to the general problem of workflow\u0000scheduling across cloud computing environments. These challenges need to be\u0000efficiently addressed for enjoying the cost savings achievable with the use of\u0000spot instances without compromising the underlying business requirements. To\u0000this end, in this paper we use Deep Reinforcement Learning for developing an\u0000autonomous agent capable of scheduling workflows in a cost efficient manner by\u0000using an intelligent mix of spot and on-demand instances. The proposed solution\u0000is implemented in the open source container native Argo workflow engine that is\u0000widely used for executing industrial workflows. The results of the experiments\u0000demonstrate that the proposed scheduling method is capable of outperforming the\u0000current benchmarks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach 为 MPI 启用实用的透明检查点：拓扑排序方法

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-05 DOI: arxiv-2408.02218

Yao Xu, Gene Cooperman

{"title":"Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach","authors":"Yao Xu, Gene Cooperman","doi":"arxiv-2408.02218","DOIUrl":"https://doi.org/arxiv-2408.02218","url":null,"abstract":"MPI is the de facto standard for parallel computing on a cluster of\u0000computers. Checkpointing is an important component in any strategy for software\u0000resilience and for long-running jobs that must be executed by chaining together\u0000time-bounded resource allocations. This work solves an old problem: a practical\u0000and general algorithm for transparent checkpointing of MPI that is both\u0000efficient and compatible with most of the latest network software. Transparent\u0000checkpointing is attractive due to its generality and ease of use for most MPI\u0000application developers. Earlier efforts at transparent checkpointing for MPI,\u0000one decade ago, had two difficult problems: (i) by relying on a specific MPI\u0000implementation tied to a specific network technology; and (ii) by failing to\u0000demonstrate sufficiently low runtime overhead. Problem (i) (network dependence) was already solved in 2019 by MANA's\u0000introduction of split processes. Problem (ii) (efficient runtime overhead) is\u0000solved in this work. This paper introduces an approach that avoids these\u0000limitations, employing a novel topological sort to algorithmically determine a\u0000safe future synchronization point. The algorithm is valid for both blocking and\u0000non-blocking collective communication in MPI. We demonstrate the efficacy and\u0000scalability of our approach through both micro-benchmarks and a set of five\u0000real-world MPI applications, notably including the widely used VASP (Vienna Ab\u0000Initio Simulation Package), which is responsible for 11% of the workload on the\u0000Perlmutter supercomputer at Lawrence Berkley National Laboratory. VASP was\u0000previously cited as a special challenge for checkpointing, in part due to its\u0000multi-algorithm codes.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Asynchronous Latency and Fast Atomic Snapshot 异步延迟和快速原子快照

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-05 DOI: arxiv-2408.02562

João Paulo Bezerra, Luciano Freitas, Petr Kuznetsov

引用次数: 0