Zhouzi Li, Benjamin Berg, Arpan Mukhopadhyay, Mor Harchol-Balter
{"title":"How to Rent GPUs on a Budget","authors":"Zhouzi Li, Benjamin Berg, Arpan Mukhopadhyay, Mor Harchol-Balter","doi":"arxiv-2406.15560","DOIUrl":"https://doi.org/arxiv-2406.15560","url":null,"abstract":"The explosion in Machine Learning (ML) over the past ten years has led to a\u0000dramatic increase in demand for GPUs to train ML models. Because it is\u0000prohibitively expensive for most users to build and maintain a large GPU\u0000cluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have\u0000seen explosive growth in demand for renting cloud-based GPUs. In this\u0000cloud-computing paradigm, a user must specify their demand for GPUs at every\u0000moment in time, and will pay for every GPU-hour they use. ML training jobs are\u0000known to be parallelizable to different degrees. Given a stream of ML training\u0000jobs, a user typically wants to minimize the mean response time across all\u0000jobs. Here, the response time of a job denotes the time from when a job arrives\u0000until it is complete. Additionally, the user is constrained by some operating\u0000budget. Specifically, in this paper the user is constrained to use no more than\u0000$b$ GPUs per hour, over a long-run time average. The question is how to\u0000minimize mean response time while meeting the budget constraint. Because\u0000training jobs receive a diminishing marginal benefit from running on additional\u0000GPUs, allocating too many GPUs to a single training job can dramatically\u0000increase the overall cost paid by the user. Hence, an optimal rental policy\u0000must balance a tradeoff between training cost and mean response time. This\u0000paper derives the optimal rental policy for a stream of training jobs where the\u0000jobs have different levels of parallelizability (specified by a speedup\u0000function) and different job sizes (amounts of inherent work). We make almost no\u0000assumptions about the arrival process and about the job size distribution. Our\u0000optimal policy specifies how many GPUs to rent at every moment in time and how\u0000to allocate these GPUs.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Speculative Decoding for Serving Large Language Models Using Goodput","authors":"Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang","doi":"arxiv-2406.14066","DOIUrl":"https://doi.org/arxiv-2406.14066","url":null,"abstract":"Reducing the inference latency of large language models (LLMs) is crucial,\u0000and speculative decoding (SD) stands out as one of the most effective\u0000techniques. Rather than letting the LLM generate all tokens directly,\u0000speculative decoding employs effective proxies to predict potential outputs,\u0000which are then verified by the LLM without compromising the generation quality.\u0000Yet, deploying SD in real online LLM serving systems (with continuous batching)\u0000does not always yield improvement -- under higher request rates or low\u0000speculation accuracy, it paradoxically increases latency. Furthermore, there is\u0000no best speculation length work for all workloads under different system loads.\u0000Based on the observations, we develop a dynamic framework SmartSpec. SmartSpec\u0000dynamically determines the best speculation length for each request (from 0,\u0000i.e., no speculation, to many tokens) -- hence the associated speculative\u0000execution costs -- based on a new metric called goodput, which characterizes\u0000the current observed load of the entire system and the speculation accuracy. We\u0000show that SmartSpec consistently reduces average request latency by up to 3.2x\u0000compared to non-speculative decoding baselines across different sizes of target\u0000models, draft models, request rates, and datasets. Moreover, SmartSpec can be\u0000applied to different styles of speculative decoding, including traditional,\u0000model-based approaches as well as model-free methods like prompt lookup and\u0000tree-style decoding.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines","authors":"Wenbo Sun, Jiaqi Wang, Qiming Guo, Ziyu Li, Wenlu Wang, Rihan Hai","doi":"arxiv-2407.12797","DOIUrl":"https://doi.org/arxiv-2407.12797","url":null,"abstract":"Online Large Language Model (LLM) services such as ChatGPT and Claude 3 have\u0000transformed business operations and academic research by effortlessly enabling\u0000new opportunities. However, due to data-sharing restrictions, sectors such as\u0000healthcare and finance prefer to deploy local LLM applications using costly\u0000hardware resources. This scenario requires a balance between the effectiveness\u0000advantages of LLMs and significant financial burdens. Additionally, the rapid\u0000evolution of models increases the frequency and redundancy of benchmarking\u0000efforts. Existing benchmarking toolkits, which typically focus on\u0000effectiveness, often overlook economic considerations, making their findings\u0000less applicable to practical scenarios. To address these challenges, we\u0000introduce CEBench, an open-source toolkit specifically designed for\u0000multi-objective benchmarking that focuses on the critical trade-offs between\u0000expenditure and effectiveness required for LLM deployments. CEBench allows for\u0000easy modifications through configuration files, enabling stakeholders to\u0000effectively assess and optimize these trade-offs. This strategic capability\u0000supports crucial decision-making processes aimed at maximizing effectiveness\u0000while minimizing cost impacts. By streamlining the evaluation process and\u0000emphasizing cost-effectiveness, CEBench seeks to facilitate the development of\u0000economically viable AI solutions across various industries and research fields.\u0000The code and demonstration are available in\u0000url{https://github.com/amademicnoboday12/CEBench}.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141746437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guanhua Wang, Olatunji Ruwase, Bing Xie, Yuxiong He
{"title":"FastPersist: Accelerating Model Checkpointing in Deep Learning","authors":"Guanhua Wang, Olatunji Ruwase, Bing Xie, Yuxiong He","doi":"arxiv-2406.13768","DOIUrl":"https://doi.org/arxiv-2406.13768","url":null,"abstract":"Model checkpoints are critical Deep Learning (DL) artifacts that enable fault\u0000tolerance for training and downstream applications, such as inference. However,\u0000writing checkpoints to persistent storage, and other I/O aspects of DL\u0000training, are mostly ignored by compute-focused optimization efforts for faster\u0000training of rapidly growing models and datasets. Towards addressing this\u0000imbalance, we propose FastPersist to accelerate checkpoint creation in DL\u0000training. FastPersist combines three novel techniques: (i) NVMe optimizations\u0000for faster checkpoint writes to SSDs, (ii) efficient write parallelism using\u0000the available SSDs in training environments, and (iii) overlapping\u0000checkpointing with independent training computations. Our evaluation using real\u0000world dense and sparse DL models shows that FastPersist creates checkpoints in\u0000persistent storage up to 116x faster than baseline, and enables per-iteration\u0000checkpointing with negligible overhead.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"62 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Apanasevich, Yogesh Kale, Himanshu Sharma, Ana Marija Sokovic
{"title":"A Comparison of the Performance of the Molecular Dynamics Simulation Package GROMACS Implemented in the SYCL and CUDA Programming Models","authors":"L. Apanasevich, Yogesh Kale, Himanshu Sharma, Ana Marija Sokovic","doi":"arxiv-2406.10362","DOIUrl":"https://doi.org/arxiv-2406.10362","url":null,"abstract":"For many years, systems running Nvidia-based GPU architectures have dominated\u0000the heterogeneous supercomputer landscape. However, recently GPU chipsets\u0000manufactured by Intel and AMD have cut into this market and can now be found in\u0000some of the worlds fastest supercomputers. The June 2023 edition of the TOP500\u0000list of supercomputers ranks the Frontier supercomputer at the Oak Ridge\u0000National Laboratory in Tennessee as the top system in the world. This system\u0000features AMD Instinct 250 X GPUs and is currently the only true exascale\u0000computer in the world.The first framework that enabled support for\u0000heterogeneous platforms across multiple hardware vendors was OpenCL, in 2009.\u0000Since then a number of frameworks have been developed to support vendor\u0000agnostic heterogeneous environments including OpenMP, OpenCL, Kokkos, and SYCL.\u0000SYCL, which combines the concepts of OpenCL with the flexibility of\u0000single-source C++, is one of the more promising programming models for\u0000heterogeneous computing devices. One key advantage of this framework is that it\u0000provides a higher-level programming interface that abstracts away many of the\u0000hardware details than the other frameworks. This makes SYCL easier to learn and\u0000to maintain across multiple architectures and vendors. In n recent years, there\u0000has been growing interest in using heterogeneous computing architectures to\u0000accelerate molecular dynamics simulations. Some of the more popular molecular\u0000dynamics simulations include Amber, NAMD, and Gromacs. However, to the best of\u0000our knowledge, only Gromacs has been successfully ported to SYCL to date. In\u0000this paper, we compare the performance of GROMACS compiled using the SYCL and\u0000CUDA frameworks for a variety of standard GROMACS benchmarks. In addition, we\u0000compare its performance across three different Nvidia GPU chipsets, P100, V100,\u0000and A100.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling Common Cause Failure in Dynamic PRA","authors":"Claudia PicocoEDF R&D, Valentin RychkovEDF R&D","doi":"arxiv-2406.08879","DOIUrl":"https://doi.org/arxiv-2406.08879","url":null,"abstract":"In this paper we propose a dynamic model of Common Cause Failures (CCF) that\u0000allows to generate common cause events in time. The proposed model is a\u0000generalization of Binomial Failure Rate Model (Atwood model) that can generate\u0000staggered failures of multiple components due to a common cause. We implement\u0000the model using statechart formalism, a similar implementation can be adopted\u0000in other modeling languages like Petri Nets or Hybrid Stochastic Automata. The\u0000presented model was integrated in a Dynamic PRA study.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"98 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander Louis-Ferdinand Jung, Jannik Steinmetz, Jonathan Gietz, Konstantin Lübeck, Oliver Bringmann
{"title":"It's all about PR -- Smart Benchmarking AI Accelerators using Performance Representatives","authors":"Alexander Louis-Ferdinand Jung, Jannik Steinmetz, Jonathan Gietz, Konstantin Lübeck, Oliver Bringmann","doi":"arxiv-2406.08330","DOIUrl":"https://doi.org/arxiv-2406.08330","url":null,"abstract":"Statistical models are widely used to estimate the performance of commercial\u0000off-the-shelf (COTS) AI hardware accelerators. However, training of statistical\u0000performance models often requires vast amounts of data, leading to a\u0000significant time investment and can be difficult in case of limited hardware\u0000availability. To alleviate this problem, we propose a novel performance\u0000modeling methodology that significantly reduces the number of training samples\u0000while maintaining good accuracy. Our approach leverages knowledge of the target\u0000hardware architecture and initial parameter sweeps to identify a set of\u0000Performance Representatives (PR) for deep neural network (DNN) layers. These\u0000PRs are then used for benchmarking, building a statistical performance model,\u0000and making estimations. This targeted approach drastically reduces the number\u0000of training samples needed, opposed to random sampling, to achieve a better\u0000estimation accuracy. We achieve a Mean Absolute Percentage Error (MAPE) of as\u0000low as 0.02% for single-layer estimations and 0.68% for whole DNN estimations\u0000with less than 10000 training samples. The results demonstrate the superiority\u0000of our method for single-layer estimations compared to models trained with\u0000randomly sampled datasets of the same size.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyungkyu Ham, Wonhyuk Yang, Yunseon Shin, Okkyun Woo, Guseul Heo, Sangyeop Lee, Jongse Park, Gwangsun Kim
{"title":"ONNXim: A Fast, Cycle-level Multi-core NPU Simulator","authors":"Hyungkyu Ham, Wonhyuk Yang, Yunseon Shin, Okkyun Woo, Guseul Heo, Sangyeop Lee, Jongse Park, Gwangsun Kim","doi":"arxiv-2406.08051","DOIUrl":"https://doi.org/arxiv-2406.08051","url":null,"abstract":"As DNNs are widely adopted in various application domains while demanding\u0000increasingly higher compute and memory requirements, designing efficient and\u0000performant NPUs (Neural Processing Units) is becoming more important. However,\u0000existing architectural NPU simulators lack support for high-speed simulation,\u0000multi-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/or\u0000different deep learning frameworks. To address these limitations, this work\u0000proposes ONNXim, a fast cycle-level simulator for multi-core NPUs in DNN\u0000serving systems. It takes DNN models represented in the ONNX graph format\u0000generated from various deep learning frameworks for ease of simulation. In\u0000addition, based on the observation that typical NPU cores process tensor tiles\u0000from on-chip scratchpad memory with deterministic compute latency, we forgo a\u0000detailed modeling for the computation while still preserving simulation\u0000accuracy. ONNXim also preserves dependencies between compute and tile DMAs.\u0000Meanwhile, the DRAM and NoC are modeled in cycle-level to properly model\u0000contention among multiple cores that can execute different DNN models for\u0000multi-tenancy. Consequently, ONNXim is significantly faster than existing\u0000simulators (e.g., by up to 384x over Accel-sim) and enables various case\u0000studies, such as multi-tenant NPUs, that were previously impractical due to\u0000slow speed and/or lack of functionalities. ONNXim is publicly available at\u0000https://github.com/PSAL-POSTECH/ONNXim.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu
{"title":"ProTrain: Efficient LLM Training via Memory-Aware Techniques","authors":"Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu","doi":"arxiv-2406.08334","DOIUrl":"https://doi.org/arxiv-2406.08334","url":null,"abstract":"It is extremely memory-hungry to train Large Language Models (LLM). To solve\u0000this problem, existing work exploits the combination of CPU and GPU for the\u0000training process, such as ZeRO-Offload. Such a technique largely democratizes\u0000billion-scale model training, making it possible to train with few consumer\u0000graphics cards. However, based on our observation, existing frameworks often\u0000provide coarse-grained memory management and require experienced experts in\u0000configuration tuning, leading to suboptimal hardware utilization and\u0000performance. This paper proposes ProTrain, a novel training system that\u0000intelligently balances memory usage and performance by coordinating memory,\u0000computation, and IO. ProTrain achieves adaptive memory management through\u0000Chunk-Based Model State Management and Block-Wise Activation Management, guided\u0000by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not\u0000change the training algorithm and thus does not compromise accuracy.\u0000Experiments show that ProTrain improves training throughput by 1.43$times$ to\u00002.71$times$ compared to the SOTA training systems.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Parallel Multi-Hop Reasoning: A Scalable Approach for Knowledge Graph Analysis","authors":"Jesmin Jahan Tithi, Fabio Checconi, Fabrizio Petrini","doi":"arxiv-2406.07727","DOIUrl":"https://doi.org/arxiv-2406.07727","url":null,"abstract":"Multi-hop reasoning (MHR) is a process in artificial intelligence and natural\u0000language processing where a system needs to make multiple inferential steps to\u0000arrive at a conclusion or answer. In the context of knowledge graphs or\u0000databases, it involves traversing multiple linked entities and relationships to\u0000understand complex queries or perform tasks requiring a deeper understanding.\u0000Multi-hop reasoning is a critical function in various applications, including\u0000question answering, knowledge base completion, and link prediction. It has\u0000garnered significant interest in artificial intelligence, machine learning, and\u0000graph analytics. This paper focuses on optimizing MHR for time efficiency on large-scale\u0000graphs, diverging from the traditional emphasis on accuracy which is an\u0000orthogonal goal. We introduce a novel parallel algorithm that harnesses\u0000domain-specific learned embeddings to efficiently identify the top K paths\u0000between vertices in a knowledge graph to find the best answers to a three-hop\u0000query. Our contributions are: (1) We present a new parallel algorithm to\u0000enhance MHR performance, scalability and efficiency. (2) We demonstrate the\u0000algorithm's superior performance on leading-edge Intel and AMD architectures\u0000through empirical results. We showcase the algorithm's practicality through a case study on identifying\u0000academic affiliations of potential Turing Award laureates in Deep Learning,\u0000highlighting its capability to handle intricate entity relationships. This\u0000demonstrates the potential of our approach to enabling high-performance MHR,\u0000useful to navigate the growing complexity of modern knowledge graphs.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"193 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141516734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}