arXiv - CS - Distributed, Parallel, and Cluster Computing最新文献_第8页

Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer 用全流水线分布式变压器训练超长语境语言模型

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-30 DOI: arxiv-2408.16978

Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda

{"title":"Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer","authors":"Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda","doi":"arxiv-2408.16978","DOIUrl":"https://doi.org/arxiv-2408.16978","url":null,"abstract":"Large Language Models (LLMs) with long context capabilities are integral to\u0000complex tasks in natural language processing and computational biology, such as\u0000text generation and protein sequence analysis. However, training LLMs directly\u0000on extremely long contexts demands considerable GPU resources and increased\u0000memory, leading to higher costs and greater complexity. Alternative approaches\u0000that introduce long context capabilities via downstream finetuning or\u0000adaptations impose significant design limitations. In this paper, we propose\u0000Fully Pipelined Distributed Transformer (FPDT) for efficiently training\u0000long-context LLMs with extreme hardware efficiency. For GPT and Llama models,\u0000we achieve a 16x increase in sequence length that can be trained on the same\u0000hardware compared to current state-of-the-art solutions. With our dedicated\u0000sequence chunk pipeline design, we can now train 8B LLM with 2 million sequence\u0000length on only 4 GPUs, while also maintaining over 55% of MFU. Our proposed\u0000FPDT is agnostic to existing training techniques and is proven to work\u0000efficiently across different LLM models.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine 在 Cerebras 晶圆级引擎上对大型语言模型的性能进行基准测试

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-30 DOI: arxiv-2409.00287

Zuoning Zhang, Dhruv Parikh, Youning Zhang, Viktor Prasanna

{"title":"Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine","authors":"Zuoning Zhang, Dhruv Parikh, Youning Zhang, Viktor Prasanna","doi":"arxiv-2409.00287","DOIUrl":"https://doi.org/arxiv-2409.00287","url":null,"abstract":"Transformer based Large Language Models (LLMs) have recently reached state of\u0000the art performance in Natural Language Processing (NLP) and Computer Vision\u0000(CV) domains. LLMs use the Multi-Headed Self-Attention (MHSA) mechanism to\u0000capture long-range global attention relationships among input words or image\u0000patches, drastically improving its performance over prior deep learning\u0000approaches. In this paper, we evaluate the performance of LLMs on the Cerebras\u0000Wafer Scale Engine (WSE). Cerebras WSE is a high performance computing system\u0000with 2.6 trillion transistors, 850,000 cores and 40 GB on-chip memory. Cerebras\u0000WSE's Sparse Linear Algebra Compute (SLAC) cores eliminates multiply-by-zeros\u0000operations and its 40 GB of on-chip memory is uniformly distributed among SLAC\u0000cores, enabling fast local access to model parameters. Moreover, Cerebras\u0000software configures routing between cores at runtime, optimizing communication\u0000overhead among cores. As LLMs are becoming more commonly used, new hardware\u0000architectures are needed to accelerate LLMs training and inference. We\u0000benchmark the effectiveness of this hardware architecture at accelerating LLMs\u0000training and inference. Additionally, we analyze if Cerebras WSE can scale the\u0000memory-wall associated with traditionally memory-bound compute tasks using its\u000020 PB/s high bandwidth memory. Furthermore, we examine the performance\u0000scalability of Cerebras WSE through a roofline model. By plotting performance\u0000metrics against computational intensity, we aim to assess their effectiveness\u0000at handling high compute-intensive LLMs training and inference tasks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Monadring: A lightweight consensus protocol to offer Validation-as-a-Service to AVS nodes Monadring：为 AVS 节点提供验证即服务的轻量级共识协议

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-28 DOI: arxiv-2408.16094

Yu Zhang, Xiao Yan, Gang Tang, Helena Wang

{"title":"Monadring: A lightweight consensus protocol to offer Validation-as-a-Service to AVS nodes","authors":"Yu Zhang, Xiao Yan, Gang Tang, Helena Wang","doi":"arxiv-2408.16094","DOIUrl":"https://doi.org/arxiv-2408.16094","url":null,"abstract":"Existing blockchain networks are often large-scale, requiring transactions to\u0000be synchronized across the entire network to reach consensus. On-chain\u0000computations can be prohibitively expensive, making many CPU-intensive\u0000computations infeasible. Inspired by the structure of IBM's token ring\u0000networks, we propose a lightweight consensus protocol called Monadring to\u0000address these issues. Monadring allows nodes within a large blockchain network\u0000to form smaller subnetworks, enabling faster and more cost-effective\u0000computations while maintaining the security guarantees of the main blockchain\u0000network. To further enhance Monadring's security, we introduce a node rotation\u0000mechanism based on Verifiable Random Function (VRF) and blind voting using\u0000Fully Homomorphic Encryption (FHE) within the smaller subnetwork. Unlike the\u0000common voting-based election of validator nodes, Monadring leverages FHE to\u0000conceal voting information, eliminating the advantage of the last mover in the\u0000voting process. This paper details the design and implementation of the Monadring protocol\u0000and evaluates its performance and feasibility through simulation experiments.\u0000Our research contributes to enhancing the practical utility of blockchain\u0000technology in large-scale application scenarios.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LLMSecCode: Evaluating Large Language Models for Secure Coding LLMSecCode：评估用于安全编码的大型语言模型

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-28 DOI: arxiv-2408.16100

Anton Rydén, Erik Näslund, Elad Michael Schiller, Magnus Almgren

{"title":"LLMSecCode: Evaluating Large Language Models for Secure Coding","authors":"Anton Rydén, Erik Näslund, Elad Michael Schiller, Magnus Almgren","doi":"arxiv-2408.16100","DOIUrl":"https://doi.org/arxiv-2408.16100","url":null,"abstract":"The rapid deployment of Large Language Models (LLMs) requires careful\u0000consideration of their effect on cybersecurity. Our work aims to improve the\u0000selection process of LLMs that are suitable for facilitating Secure Coding\u0000(SC). This raises challenging research questions, such as (RQ1) Which\u0000functionality can streamline the LLM evaluation? (RQ2) What should the\u0000evaluation measure? (RQ3) How to attest that the evaluation process is\u0000impartial? To address these questions, we introduce LLMSecCode, an open-source\u0000evaluation framework designed to assess LLM SC capabilities objectively. We validate the LLMSecCode implementation through experiments. When varying\u0000parameters and prompts, we find a 10% and 9% difference in performance,\u0000respectively. We also compare some results to reliable external actors, where\u0000our results show a 5% difference. We strive to ensure the ease of use of our open-source framework and\u0000encourage further development by external actors. With LLMSecCode, we hope to\u0000encourage the standardization and benchmarking of LLMs' capabilities in\u0000security-oriented code and tasks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Decentralized LLM Inference over Edge Networks with Energy Harvesting 利用能量收集对边缘网络进行分散式 LLM 推断

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-28 DOI: arxiv-2408.15907

Aria Khoshsirat, Giovanni Perin, Michele Rossi

引用次数: 0

Towards cloud-native scientific workflow management 实现云原生科学工作流程管理

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-27 DOI: arxiv-2408.15445

Michal Orzechowski, Bartosz Balis, Krzysztof Janecki

{"title":"Towards cloud-native scientific workflow management","authors":"Michal Orzechowski, Bartosz Balis, Krzysztof Janecki","doi":"arxiv-2408.15445","DOIUrl":"https://doi.org/arxiv-2408.15445","url":null,"abstract":"Cloud-native is an approach to building and running scalable applications in\u0000modern cloud infrastructures, with the Kubernetes container orchestration\u0000platform being often considered as a fundamental cloud-native building block.\u0000In this paper, we evaluate alternative execution models for scientific\u0000workflows in Kubernetes. We compare the simplest job-based model, its variant\u0000with task clustering, and finally we propose a cloud-native model based on\u0000microservices comprising auto-scalable worker-pools. We implement the proposed\u0000models in the HyperFlow workflow management system, and evaluate them using a\u0000large Montage workflow on a Kubernetes cluster. The results indicate that the\u0000proposed cloud-native worker-pools execution model achieves best performance in\u0000terms of average cluster utilization, resulting in a nearly 20% improvement of\u0000the workflow makespan compared to the best-performing job-based model. However,\u0000better performance comes at the cost of significantly higher complexity of the\u0000implementation and maintenance. We believe that our experiments provide a\u0000valuable insight into the performance, advantages and disadvantages of\u0000alternative cloud-native execution models for scientific workflows.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bandwidth-Aware and Overlap-Weighted Compression for Communication-Efficient Federated Learning 带宽感知和重叠加权压缩，实现高效通信的联合学习

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-27 DOI: arxiv-2408.14736

Zichen Tang, Junlin Huang, Rudan Yan, Yuxin Wang, Zhenheng Tang, Shaohuai Shi, Amelie Chi Zhou, Xiaowen Chu

{"title":"Bandwidth-Aware and Overlap-Weighted Compression for Communication-Efficient Federated Learning","authors":"Zichen Tang, Junlin Huang, Rudan Yan, Yuxin Wang, Zhenheng Tang, Shaohuai Shi, Amelie Chi Zhou, Xiaowen Chu","doi":"arxiv-2408.14736","DOIUrl":"https://doi.org/arxiv-2408.14736","url":null,"abstract":"Current data compression methods, such as sparsification in Federated\u0000Averaging (FedAvg), effectively enhance the communication efficiency of\u0000Federated Learning (FL). However, these methods encounter challenges such as\u0000the straggler problem and diminished model performance due to heterogeneous\u0000bandwidth and non-IID (Independently and Identically Distributed) data. To\u0000address these issues, we introduce a bandwidth-aware compression framework for\u0000FL, aimed at improving communication efficiency while mitigating the problems\u0000associated with non-IID data. First, our strategy dynamically adjusts\u0000compression ratios according to bandwidth, enabling clients to upload their\u0000models at a close pace, thus exploiting the otherwise wasted time to transmit\u0000more data. Second, we identify the non-overlapped pattern of retained\u0000parameters after compression, which results in diminished client update signals\u0000due to uniformly averaged weights. Based on this finding, we propose a\u0000parameter mask to adjust the client-averaging coefficients at the parameter\u0000level, thereby more closely approximating the original updates, and improving\u0000the training convergence under heterogeneous environments. Our evaluations\u0000reveal that our method significantly boosts model accuracy, with a maximum\u0000improvement of 13% over the uncompressed FedAvg. Moreover, it achieves a\u0000$3.37times$ speedup in reaching the target accuracy compared to FedAvg with a\u0000Top-K compressor, demonstrating its effectiveness in accelerating convergence\u0000with compression. The integration of common compression techniques into our\u0000framework further establishes its potential as a versatile foundation for\u0000future cross-device, communication-efficient FL research, addressing critical\u0000challenges in FL and advancing the field of distributed machine learning.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Faster Cycle Detection in the Congested Clique 在拥挤的小群中更快地检测周期

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-27 DOI: arxiv-2408.15132

Keren Censor-Hillel, Tomer Even, Virginia Vassilevska Williams

引用次数: 0

Towards observability of scientific applications 实现科学应用的可观测性

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-27 DOI: arxiv-2408.15439

Bartosz Balis, Konrad Czerepak, Albert Kuzma, Jan Meizner, Lukasz Wronski

{"title":"Towards observability of scientific applications","authors":"Bartosz Balis, Konrad Czerepak, Albert Kuzma, Jan Meizner, Lukasz Wronski","doi":"arxiv-2408.15439","DOIUrl":"https://doi.org/arxiv-2408.15439","url":null,"abstract":"As software systems increase in complexity, conventional monitoring methods\u0000struggle to provide a comprehensive overview or identify performance issues,\u0000often missing unexpected problems. Observability, however, offers a holistic\u0000approach, providing methods and tools that gather and analyze detailed\u0000telemetry data to uncover hidden issues. Originally developed for cloud-native\u0000systems, modern observability is less prevalent in scientific computing,\u0000particularly in HPC clusters, due to differences in application architecture,\u0000execution environments, and technology stacks. This paper proposes and\u0000evaluates an end-to-end observability solution tailored for scientific\u0000computing in HPC environments. We address several challenges, including\u0000collection of application-level metrics, instrumentation, context propagation,\u0000and tracing. We argue that typical dashboards with charts are not sufficient\u0000for advanced observability-driven analysis of scientific applications.\u0000Consequently, we propose a different approach based on data analysis using\u0000DataFrames and a Jupyter environment. The proposed solution is implemented and\u0000evaluated on two medical scientific pipelines running on an HPC cluster.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"177 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Partition Detection in Byzantine Networks 拜占庭网络中的分区检测

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-27 DOI: arxiv-2408.14814

Yérom-David BrombergIRISA, UR, Jérémie DecouchantTU Delft, Manon SourisseauIRISA, UR, François TaïaniIRISA, UR

{"title":"Partition Detection in Byzantine Networks","authors":"Yérom-David BrombergIRISA, UR, Jérémie DecouchantTU Delft, Manon SourisseauIRISA, UR, François TaïaniIRISA, UR","doi":"arxiv-2408.14814","DOIUrl":"https://doi.org/arxiv-2408.14814","url":null,"abstract":"Detecting and handling network partitions is a fundamental requirement of\u0000distributed systems. Although existing partition detection methods in arbitrary\u0000graphs tolerate unreliable networks, they either assume that all nodes are\u0000correct or that a limited number of nodes might crash. In particular, Byzantine\u0000behaviors are out of the scope of these algorithms despite Byzantine fault\u0000tolerance being an active research topic for important problems such as\u0000consensus. Moreover, Byzantinetolerant protocols, such as broadcast or\u0000consensus, always rely on the assumption of connected networks. This paper\u0000addresses the problem of detecting partition in Byzantine networks (without\u0000connectivity assumption). We present a novel algorithm, which we call NECTAR,\u0000that safely detects partitioned and possibly partitionable networks and prove\u0000its correctness. NECTAR allows all correct nodes to detect whether a network\u0000could suffer from Byzantine nodes. We evaluate NECTAR's performance and compare\u0000it to two existing baselines using up to 100 nodes running real code, on\u0000various realistic topologies. Our results confirm that NECTAR maintains a 100%\u0000accuracy while the accuracy of the various existing baselines decreases by at\u0000least 40% as soon as one participant is Byzantine. Although NECTAR's network\u0000cost increases with the number of nodes and decreases with the network's\u0000diameter, it does not go above around 500KB in the worst cases.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0