Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda
{"title":"Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer","authors":"Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda","doi":"arxiv-2408.16978","DOIUrl":"https://doi.org/arxiv-2408.16978","url":null,"abstract":"Large Language Models (LLMs) with long context capabilities are integral to\u0000complex tasks in natural language processing and computational biology, such as\u0000text generation and protein sequence analysis. However, training LLMs directly\u0000on extremely long contexts demands considerable GPU resources and increased\u0000memory, leading to higher costs and greater complexity. Alternative approaches\u0000that introduce long context capabilities via downstream finetuning or\u0000adaptations impose significant design limitations. In this paper, we propose\u0000Fully Pipelined Distributed Transformer (FPDT) for efficiently training\u0000long-context LLMs with extreme hardware efficiency. For GPT and Llama models,\u0000we achieve a 16x increase in sequence length that can be trained on the same\u0000hardware compared to current state-of-the-art solutions. With our dedicated\u0000sequence chunk pipeline design, we can now train 8B LLM with 2 million sequence\u0000length on only 4 GPUs, while also maintaining over 55% of MFU. Our proposed\u0000FPDT is agnostic to existing training techniques and is proven to work\u0000efficiently across different LLM models.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zuoning Zhang, Dhruv Parikh, Youning Zhang, Viktor Prasanna
{"title":"Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine","authors":"Zuoning Zhang, Dhruv Parikh, Youning Zhang, Viktor Prasanna","doi":"arxiv-2409.00287","DOIUrl":"https://doi.org/arxiv-2409.00287","url":null,"abstract":"Transformer based Large Language Models (LLMs) have recently reached state of\u0000the art performance in Natural Language Processing (NLP) and Computer Vision\u0000(CV) domains. LLMs use the Multi-Headed Self-Attention (MHSA) mechanism to\u0000capture long-range global attention relationships among input words or image\u0000patches, drastically improving its performance over prior deep learning\u0000approaches. In this paper, we evaluate the performance of LLMs on the Cerebras\u0000Wafer Scale Engine (WSE). Cerebras WSE is a high performance computing system\u0000with 2.6 trillion transistors, 850,000 cores and 40 GB on-chip memory. Cerebras\u0000WSE's Sparse Linear Algebra Compute (SLAC) cores eliminates multiply-by-zeros\u0000operations and its 40 GB of on-chip memory is uniformly distributed among SLAC\u0000cores, enabling fast local access to model parameters. Moreover, Cerebras\u0000software configures routing between cores at runtime, optimizing communication\u0000overhead among cores. As LLMs are becoming more commonly used, new hardware\u0000architectures are needed to accelerate LLMs training and inference. We\u0000benchmark the effectiveness of this hardware architecture at accelerating LLMs\u0000training and inference. Additionally, we analyze if Cerebras WSE can scale the\u0000memory-wall associated with traditionally memory-bound compute tasks using its\u000020 PB/s high bandwidth memory. Furthermore, we examine the performance\u0000scalability of Cerebras WSE through a roofline model. By plotting performance\u0000metrics against computational intensity, we aim to assess their effectiveness\u0000at handling high compute-intensive LLMs training and inference tasks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Monadring: A lightweight consensus protocol to offer Validation-as-a-Service to AVS nodes","authors":"Yu Zhang, Xiao Yan, Gang Tang, Helena Wang","doi":"arxiv-2408.16094","DOIUrl":"https://doi.org/arxiv-2408.16094","url":null,"abstract":"Existing blockchain networks are often large-scale, requiring transactions to\u0000be synchronized across the entire network to reach consensus. On-chain\u0000computations can be prohibitively expensive, making many CPU-intensive\u0000computations infeasible. Inspired by the structure of IBM's token ring\u0000networks, we propose a lightweight consensus protocol called Monadring to\u0000address these issues. Monadring allows nodes within a large blockchain network\u0000to form smaller subnetworks, enabling faster and more cost-effective\u0000computations while maintaining the security guarantees of the main blockchain\u0000network. To further enhance Monadring's security, we introduce a node rotation\u0000mechanism based on Verifiable Random Function (VRF) and blind voting using\u0000Fully Homomorphic Encryption (FHE) within the smaller subnetwork. Unlike the\u0000common voting-based election of validator nodes, Monadring leverages FHE to\u0000conceal voting information, eliminating the advantage of the last mover in the\u0000voting process. This paper details the design and implementation of the Monadring protocol\u0000and evaluates its performance and feasibility through simulation experiments.\u0000Our research contributes to enhancing the practical utility of blockchain\u0000technology in large-scale application scenarios.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anton Rydén, Erik Näslund, Elad Michael Schiller, Magnus Almgren
{"title":"LLMSecCode: Evaluating Large Language Models for Secure Coding","authors":"Anton Rydén, Erik Näslund, Elad Michael Schiller, Magnus Almgren","doi":"arxiv-2408.16100","DOIUrl":"https://doi.org/arxiv-2408.16100","url":null,"abstract":"The rapid deployment of Large Language Models (LLMs) requires careful\u0000consideration of their effect on cybersecurity. Our work aims to improve the\u0000selection process of LLMs that are suitable for facilitating Secure Coding\u0000(SC). This raises challenging research questions, such as (RQ1) Which\u0000functionality can streamline the LLM evaluation? (RQ2) What should the\u0000evaluation measure? (RQ3) How to attest that the evaluation process is\u0000impartial? To address these questions, we introduce LLMSecCode, an open-source\u0000evaluation framework designed to assess LLM SC capabilities objectively. We validate the LLMSecCode implementation through experiments. When varying\u0000parameters and prompts, we find a 10% and 9% difference in performance,\u0000respectively. We also compare some results to reliable external actors, where\u0000our results show a 5% difference. We strive to ensure the ease of use of our open-source framework and\u0000encourage further development by external actors. With LLMSecCode, we hope to\u0000encourage the standardization and benchmarking of LLMs' capabilities in\u0000security-oriented code and tasks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Decentralized LLM Inference over Edge Networks with Energy Harvesting","authors":"Aria Khoshsirat, Giovanni Perin, Michele Rossi","doi":"arxiv-2408.15907","DOIUrl":"https://doi.org/arxiv-2408.15907","url":null,"abstract":"Large language models have significantly transformed multiple fields with\u0000their exceptional performance in natural language tasks, but their deployment\u0000in resource-constrained environments like edge networks presents an ongoing\u0000challenge. Decentralized techniques for inference have emerged, distributing\u0000the model blocks among multiple devices to improve flexibility and cost\u0000effectiveness. However, energy limitations remain a significant concern for\u0000edge devices. We propose a sustainable model for collaborative inference on\u0000interconnected, battery-powered edge devices with energy harvesting. A\u0000semi-Markov model is developed to describe the states of the devices,\u0000considering processing parameters and average green energy arrivals. This\u0000informs the design of scheduling algorithms that aim to minimize device\u0000downtimes and maximize network throughput. Through empirical evaluations and\u0000simulated runs, we validate the effectiveness of our approach, paving the way\u0000for energy-efficient decentralized inference over edge networks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michal Orzechowski, Bartosz Balis, Krzysztof Janecki
{"title":"Towards cloud-native scientific workflow management","authors":"Michal Orzechowski, Bartosz Balis, Krzysztof Janecki","doi":"arxiv-2408.15445","DOIUrl":"https://doi.org/arxiv-2408.15445","url":null,"abstract":"Cloud-native is an approach to building and running scalable applications in\u0000modern cloud infrastructures, with the Kubernetes container orchestration\u0000platform being often considered as a fundamental cloud-native building block.\u0000In this paper, we evaluate alternative execution models for scientific\u0000workflows in Kubernetes. We compare the simplest job-based model, its variant\u0000with task clustering, and finally we propose a cloud-native model based on\u0000microservices comprising auto-scalable worker-pools. We implement the proposed\u0000models in the HyperFlow workflow management system, and evaluate them using a\u0000large Montage workflow on a Kubernetes cluster. The results indicate that the\u0000proposed cloud-native worker-pools execution model achieves best performance in\u0000terms of average cluster utilization, resulting in a nearly 20% improvement of\u0000the workflow makespan compared to the best-performing job-based model. However,\u0000better performance comes at the cost of significantly higher complexity of the\u0000implementation and maintenance. We believe that our experiments provide a\u0000valuable insight into the performance, advantages and disadvantages of\u0000alternative cloud-native execution models for scientific workflows.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bandwidth-Aware and Overlap-Weighted Compression for Communication-Efficient Federated Learning","authors":"Zichen Tang, Junlin Huang, Rudan Yan, Yuxin Wang, Zhenheng Tang, Shaohuai Shi, Amelie Chi Zhou, Xiaowen Chu","doi":"arxiv-2408.14736","DOIUrl":"https://doi.org/arxiv-2408.14736","url":null,"abstract":"Current data compression methods, such as sparsification in Federated\u0000Averaging (FedAvg), effectively enhance the communication efficiency of\u0000Federated Learning (FL). However, these methods encounter challenges such as\u0000the straggler problem and diminished model performance due to heterogeneous\u0000bandwidth and non-IID (Independently and Identically Distributed) data. To\u0000address these issues, we introduce a bandwidth-aware compression framework for\u0000FL, aimed at improving communication efficiency while mitigating the problems\u0000associated with non-IID data. First, our strategy dynamically adjusts\u0000compression ratios according to bandwidth, enabling clients to upload their\u0000models at a close pace, thus exploiting the otherwise wasted time to transmit\u0000more data. Second, we identify the non-overlapped pattern of retained\u0000parameters after compression, which results in diminished client update signals\u0000due to uniformly averaged weights. Based on this finding, we propose a\u0000parameter mask to adjust the client-averaging coefficients at the parameter\u0000level, thereby more closely approximating the original updates, and improving\u0000the training convergence under heterogeneous environments. Our evaluations\u0000reveal that our method significantly boosts model accuracy, with a maximum\u0000improvement of 13% over the uncompressed FedAvg. Moreover, it achieves a\u0000$3.37times$ speedup in reaching the target accuracy compared to FedAvg with a\u0000Top-K compressor, demonstrating its effectiveness in accelerating convergence\u0000with compression. The integration of common compression techniques into our\u0000framework further establishes its potential as a versatile foundation for\u0000future cross-device, communication-efficient FL research, addressing critical\u0000challenges in FL and advancing the field of distributed machine learning.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keren Censor-Hillel, Tomer Even, Virginia Vassilevska Williams
{"title":"Faster Cycle Detection in the Congested Clique","authors":"Keren Censor-Hillel, Tomer Even, Virginia Vassilevska Williams","doi":"arxiv-2408.15132","DOIUrl":"https://doi.org/arxiv-2408.15132","url":null,"abstract":"We provide a fast distributed algorithm for detecting $h$-cycles in the\u0000textsf{Congested Clique} model, whose running time decreases as the number of\u0000$h$-cycles in the graph increases. In undirected graphs, constant-round\u0000algorithms are known for cycles of even length. Our algorithm greatly improves\u0000upon the state of the art for odd values of $h$. Moreover, our running time\u0000applies also to directed graphs, in which case the improvement is for all\u0000values of $h$. Further, our techniques allow us to obtain a triangle detection\u0000algorithm in the quantum variant of this model, which is faster than prior\u0000work. A key technical contribution we develop to obtain our fast cycle detection\u0000algorithm is a new algorithm for computing the product of many pairs of small\u0000matrices in parallel, which may be of independent interest.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bartosz Balis, Konrad Czerepak, Albert Kuzma, Jan Meizner, Lukasz Wronski
{"title":"Towards observability of scientific applications","authors":"Bartosz Balis, Konrad Czerepak, Albert Kuzma, Jan Meizner, Lukasz Wronski","doi":"arxiv-2408.15439","DOIUrl":"https://doi.org/arxiv-2408.15439","url":null,"abstract":"As software systems increase in complexity, conventional monitoring methods\u0000struggle to provide a comprehensive overview or identify performance issues,\u0000often missing unexpected problems. Observability, however, offers a holistic\u0000approach, providing methods and tools that gather and analyze detailed\u0000telemetry data to uncover hidden issues. Originally developed for cloud-native\u0000systems, modern observability is less prevalent in scientific computing,\u0000particularly in HPC clusters, due to differences in application architecture,\u0000execution environments, and technology stacks. This paper proposes and\u0000evaluates an end-to-end observability solution tailored for scientific\u0000computing in HPC environments. We address several challenges, including\u0000collection of application-level metrics, instrumentation, context propagation,\u0000and tracing. We argue that typical dashboards with charts are not sufficient\u0000for advanced observability-driven analysis of scientific applications.\u0000Consequently, we propose a different approach based on data analysis using\u0000DataFrames and a Jupyter environment. The proposed solution is implemented and\u0000evaluated on two medical scientific pipelines running on an HPC cluster.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"177 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yérom-David BrombergIRISA, UR, Jérémie DecouchantTU Delft, Manon SourisseauIRISA, UR, François TaïaniIRISA, UR
{"title":"Partition Detection in Byzantine Networks","authors":"Yérom-David BrombergIRISA, UR, Jérémie DecouchantTU Delft, Manon SourisseauIRISA, UR, François TaïaniIRISA, UR","doi":"arxiv-2408.14814","DOIUrl":"https://doi.org/arxiv-2408.14814","url":null,"abstract":"Detecting and handling network partitions is a fundamental requirement of\u0000distributed systems. Although existing partition detection methods in arbitrary\u0000graphs tolerate unreliable networks, they either assume that all nodes are\u0000correct or that a limited number of nodes might crash. In particular, Byzantine\u0000behaviors are out of the scope of these algorithms despite Byzantine fault\u0000tolerance being an active research topic for important problems such as\u0000consensus. Moreover, Byzantinetolerant protocols, such as broadcast or\u0000consensus, always rely on the assumption of connected networks. This paper\u0000addresses the problem of detecting partition in Byzantine networks (without\u0000connectivity assumption). We present a novel algorithm, which we call NECTAR,\u0000that safely detects partitioned and possibly partitionable networks and prove\u0000its correctness. NECTAR allows all correct nodes to detect whether a network\u0000could suffer from Byzantine nodes. We evaluate NECTAR's performance and compare\u0000it to two existing baselines using up to 100 nodes running real code, on\u0000various realistic topologies. Our results confirm that NECTAR maintains a 100%\u0000accuracy while the accuracy of the various existing baselines decreases by at\u0000least 40% as soon as one participant is Byzantine. Although NECTAR's network\u0000cost increases with the number of nodes and decreases with the network's\u0000diameter, it does not go above around 500KB in the worst cases.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}