arXiv - CS - Distributed, Parallel, and Cluster Computing最新文献_第9页

Analysis of the Performance of the Matrix Multiplication Algorithm on the Cirrus Supercomputer Cirrus 超级计算机上的矩阵乘法算法性能分析

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-27 DOI: arxiv-2408.15384

Temitayo Adefemi

引用次数: 0

A parallel particle cluster algorithm using nearest neighbour graphs and passive target communication 使用近邻图和被动目标通信的并行粒子群算法

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-27 DOI: arxiv-2408.15348

Matthias Frey, Steven Böing, Rui F. G. Apóstolo

引用次数: 0

A sparsity-aware distributed-memory algorithm for sparse-sparse matrix multiplication 稀疏-稀疏矩阵乘法的稀疏感知分布式内存算法

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-26 DOI: arxiv-2408.14558

Yuxi Hong, Aydin Buluc

引用次数: 0

Scalable, reproducible, and cost-effective processing of large-scale medical imaging datasets 可扩展、可重现、经济高效地处理大规模医学成像数据集

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-26 DOI: arxiv-2408.14611

Michael E. Kim, Karthik Ramadass, Chenyu Gao, Praitayini Kanakaraj, Nancy R. Newlin, Gaurav Rudravaram, Kurt G. Schilling, Blake E. Dewey, Derek Archer, Timothy J. Hohman, Zhiyuan Li, Shunxing Bao, Bennett A. Landman, Nazirah Mohd Khairi

{"title":"Scalable, reproducible, and cost-effective processing of large-scale medical imaging datasets","authors":"Michael E. Kim, Karthik Ramadass, Chenyu Gao, Praitayini Kanakaraj, Nancy R. Newlin, Gaurav Rudravaram, Kurt G. Schilling, Blake E. Dewey, Derek Archer, Timothy J. Hohman, Zhiyuan Li, Shunxing Bao, Bennett A. Landman, Nazirah Mohd Khairi","doi":"arxiv-2408.14611","DOIUrl":"https://doi.org/arxiv-2408.14611","url":null,"abstract":"Curating, processing, and combining large-scale medical imaging datasets from\u0000national studies is a non-trivial task due to the intense computation and data\u0000throughput required, variability of acquired data, and associated financial\u0000overhead. Existing platforms or tools for large-scale data curation,\u0000processing, and storage have difficulty achieving a viable cost-to-scale ratio\u0000of computation speed for research purposes, either being too slow or too\u0000expensive. Additionally, management and consistency of processing large data in\u0000a team-driven manner is a non-trivial task. We design a BIDS-compliant method\u0000for an efficient and robust data processing pipeline of large-scale\u0000diffusion-weighted and T1-weighted MRI data compatible with low-cost,\u0000high-efficiency computing systems. Our method accomplishes automated querying\u0000of data available for processing and process running in a consistent and\u0000reproducible manner that has long-term stability, while using heterogenous\u0000low-cost computational resources and storage systems for efficient processing\u0000and data transfer. We demonstrate how our organizational structure permits\u0000efficiency in a semi-automated data processing pipeline and show how our method\u0000is comparable in processing time to cloud-based computation while being almost\u000020 times more cost-effective. Our design allows for fast data throughput speeds\u0000and low latency to reduce the time for data transfer between storage servers\u0000and computation servers, achieving an average of 0.60 Gb/s compared to 0.33\u0000Gb/s for using cloud-based processing methods. The design of our workflow\u0000engine permits quick process running while maintaining flexibility to adapt to\u0000newly acquired data.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Resource Efficient Asynchronous Federated Learning for Digital Twin Empowered IoT Network 为数字双胞胎赋能的物联网网络提供资源高效的异步联盟学习

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-26 DOI: arxiv-2408.14298

Shunfeng Chu, Jun Li, Jianxin Wang, Yiyang Ni, Kang Wei, Wen Chen, Shi Jin

{"title":"Resource Efficient Asynchronous Federated Learning for Digital Twin Empowered IoT Network","authors":"Shunfeng Chu, Jun Li, Jianxin Wang, Yiyang Ni, Kang Wei, Wen Chen, Shi Jin","doi":"arxiv-2408.14298","DOIUrl":"https://doi.org/arxiv-2408.14298","url":null,"abstract":"As an emerging technology, digital twin (DT) can provide real-time status and\u0000dynamic topology mapping for Internet of Things (IoT) devices. However, DT and\u0000its implementation within industrial IoT networks necessitates substantial,\u0000distributed data support, which often leads to ``data silos'' and raises\u0000privacy concerns. To address these issues, we develop a dynamic resource\u0000scheduling algorithm tailored for the asynchronous federated learning\u0000(FL)-based lightweight DT empowered IoT network. Specifically, our approach\u0000aims to minimize a multi-objective function that encompasses both energy\u0000consumption and latency by optimizing IoT device selection and transmit power\u0000control, subject to FL model performance constraints. We utilize the Lyapunov\u0000method to decouple the formulated problem into a series of one-slot\u0000optimization problems and develop a two-stage optimization algorithm to achieve\u0000the optimal transmission power control and IoT device scheduling strategies. In\u0000the first stage, we derive closed-form solutions for optimal transmit power on\u0000the IoT device side. In the second stage, since partial state information is\u0000unknown, e.g., the transmitting power and computational frequency of IoT\u0000device, the edge server employs a multi-armed bandit (MAB) framework to model\u0000the IoT device selection problem and utilizes an efficient online algorithm,\u0000namely the client utility-based upper confidence bound (CU-UCB), to address it.\u0000Numerical results validate our algorithm's superiority over benchmark schemes,\u0000and simulations demonstrate that our algorithm achieves faster training speeds\u0000on the Fashion-MNIST and CIFAR-10 datasets within the same training duration.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Employing Artificial Intelligence to Steer Exascale Workflows with Colmena 利用人工智能引导 Colmena 的超大规模工作流

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-26 DOI: arxiv-2408.14434

Logan Ward, J. Gregory Pauloski, Valerie Hayot-Sasson, Yadu Babuji, Alexander Brace, Ryan Chard, Kyle Chard, Rajeev Thakur, Ian Foster

{"title":"Employing Artificial Intelligence to Steer Exascale Workflows with Colmena","authors":"Logan Ward, J. Gregory Pauloski, Valerie Hayot-Sasson, Yadu Babuji, Alexander Brace, Ryan Chard, Kyle Chard, Rajeev Thakur, Ian Foster","doi":"arxiv-2408.14434","DOIUrl":"https://doi.org/arxiv-2408.14434","url":null,"abstract":"Computational workflows are a common class of application on supercomputers,\u0000yet the loosely coupled and heterogeneous nature of workflows often fails to\u0000take full advantage of their capabilities. We created Colmena to leverage the\u0000massive parallelism of a supercomputer by using Artificial Intelligence (AI) to\u0000learn from and adapt a workflow as it executes. Colmena allows scientists to\u0000define how their application should respond to events (e.g., task completion)\u0000as a series of cooperative agents. In this paper, we describe the design of\u0000Colmena, the challenges we overcame while deploying applications on exascale\u0000systems, and the science workflows we have enhanced through interweaving AI.\u0000The scaling challenges we discuss include developing steering strategies that\u0000maximize node utilization, introducing data fabrics that reduce communication\u0000overhead of data-intensive tasks, and implementing workflow tasks that cache\u0000costly operations between invocations. These innovations coupled with a variety\u0000of application patterns accessible through our agent-based steering model have\u0000enabled science advances in chemistry, biophysics, and materials science using\u0000different types of AI. Our vision is that Colmena will spur creative solutions\u0000that harness AI across many domains of scientific computing.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge 边缘最优分布式矩阵计算的稀疏性保护编码

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-09 DOI: arxiv-2408.05152

Anindya Bijoy Das, Aditya Ramamoorthy, David J. Love, Christopher G. Brinton

{"title":"Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge","authors":"Anindya Bijoy Das, Aditya Ramamoorthy, David J. Love, Christopher G. Brinton","doi":"arxiv-2408.05152","DOIUrl":"https://doi.org/arxiv-2408.05152","url":null,"abstract":"Matrix computations are a fundamental building-block of edge computing\u0000systems, with a major recent uptick in demand due to their use in AI/ML\u0000training and inference procedures. Existing approaches for distributing matrix\u0000computations involve allocating coded combinations of submatrices to worker\u0000nodes, to build resilience to slower nodes, called stragglers. In the edge\u0000learning context, however, these approaches will compromise sparsity properties\u0000that are often present in the original matrices found at the edge server. In\u0000this study, we consider the challenge of augmenting such approaches to preserve\u0000input sparsity when distributing the task across edge devices, thereby\u0000retaining the associated computational efficiency enhancements. First, we find\u0000a lower bound on the weight of coding, i.e., the number of submatrices to be\u0000combined to obtain coded submatrices, to provide the resilience to the maximum\u0000possible number of straggler devices (for given number of devices and their\u0000storage constraints). Next we propose distributed matrix computation schemes\u0000which meet the exact lower bound on the weight of the coding. Numerical\u0000experiments conducted in Amazon Web Services (AWS) validate our assertions\u0000regarding straggler mitigation and computation speed for sparse matrices.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distributed Augmentation, Hypersweeps, and Branch Decomposition of Contour Trees for Scientific Exploration 用于科学探索的等值线树的分布式扩增、超扫和分支分解

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-09 DOI: arxiv-2408.04836

Mingzhe Li, Hamish Carr, Oliver Rübel, Bei Wang, Gunther H. Weber

{"title":"Distributed Augmentation, Hypersweeps, and Branch Decomposition of Contour Trees for Scientific Exploration","authors":"Mingzhe Li, Hamish Carr, Oliver Rübel, Bei Wang, Gunther H. Weber","doi":"arxiv-2408.04836","DOIUrl":"https://doi.org/arxiv-2408.04836","url":null,"abstract":"Contour trees describe the topology of level sets in scalar fields and are\u0000widely used in topological data analysis and visualization. A main challenge of\u0000utilizing contour trees for large-scale scientific data is their computation at\u0000scale using high-performance computing. To address this challenge, recent work\u0000has introduced distributed hierarchical contour trees for distributed\u0000computation and storage of contour trees. However, effective use of these\u0000distributed structures in analysis and visualization requires subsequent\u0000computation of geometric properties and branch decomposition to support contour\u0000extraction and exploration. In this work, we introduce distributed algorithms\u0000for augmentation, hypersweeps, and branch decomposition that enable parallel\u0000computation of geometric properties, and support the use of distributed contour\u0000trees as query structures for scientific exploration. We evaluate the parallel\u0000performance of these algorithms and apply them to identify and extract\u0000important contours for scientific visualization.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor 在内核间互联智能处理器上扩展深度学习计算

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-09 DOI: arxiv-2408.04808

Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang

{"title":"Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor","authors":"Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang","doi":"arxiv-2408.04808","DOIUrl":"https://doi.org/arxiv-2408.04808","url":null,"abstract":"As AI chips incorporate numerous parallelized cores to scale deep learning\u0000(DL) computing, inter-core communication is enabled recently by employing\u0000high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore\u0000IPU). It allows each core to directly access the fast scratchpad memory in\u0000other cores, which enables new parallel computing paradigms. However, without\u0000proper support for the scalable inter-core connections in current DL compilers,\u0000it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication\u0000bandwidth and distributed on-chip memory on AI chips. To formulate the\u0000computation and communication patterns of tensor operators in this new\u0000architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps\u0000a DNN model to execution plans with a generalized compute-shift pattern, by\u0000partitioning DNN computation into sub-operators and mapping them to cores, so\u0000that the cores can exchange data following predictable patterns. T10 makes\u0000globally optimized trade-offs between on-chip memory consumption and inter-core\u0000communication overhead, selects the best execution plan from a vast\u0000optimization space, and alleviates unnecessary inter-core communications. Our\u0000evaluation with a real inter-core connected AI chip, the Graphcore IPU, shows\u0000up to 3.3$times$ performance improvement, and scalability support for larger\u0000models, compared to state-of-the-art DL compilers and vendor libraries.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training 部分专家检查点：稀疏专家混合模型训练的高效容错能力

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-08-08 DOI: arxiv-2408.04307

Weilin Cai, Le Qin, Jiayi Huang

{"title":"Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training","authors":"Weilin Cai, Le Qin, Jiayi Huang","doi":"arxiv-2408.04307","DOIUrl":"https://doi.org/arxiv-2408.04307","url":null,"abstract":"As large language models continue to scale up, the imperative for fault\u0000tolerance in distributed deep learning systems intensifies, becoming a focal\u0000area of AI infrastructure research. Checkpoint has emerged as the predominant\u0000fault tolerance strategy, with extensive studies dedicated to optimizing its\u0000efficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model\u0000presents new challenges for traditional checkpoint techniques due to the\u0000substantial increase in model size, despite comparable computational demands to\u0000dense models. Breaking new ground in the realm of efficient fault tolerance for\u0000MoE model training, we introduce a novel Partial Experts Checkpoint (PEC)\u0000mechanism alongside a corresponding PEC fault-tolerant system. Our approach\u0000strategically checkpoints a selected subset of experts, thereby significantly\u0000reducing the checkpoint size for MoE models to a level comparable with that of\u0000dense models. The empirical analysis on our 8-expert GPT-MoE model demonstrates\u0000that the proposed PEC approach facilitates a substantial 54.2% decrease in the\u0000size of non-redundant checkpoint (no data-parallel duplication), without\u0000compromising the final model quality. Moreover, our PEC fault-tolerant system\u0000achieves a 76.9% reduction in checkpoint workload per data-parallel distributed\u0000rank, thereby correspondingly diminishing the checkpointing time and\u0000facilitating complete overlap with the training process.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141949066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0