2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第4页

Batched sparse iterative solvers on GPU for the collision operator for fusion plasma simulations 基于GPU的融合等离子体模拟碰撞算子的批处理稀疏迭代求解

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00024

Aditya Kashi, Pratik Nayak, Dhruva Kulkarni, A. Scheinberg, Paul Lin, H. Anzt

{"title":"Batched sparse iterative solvers on GPU for the collision operator for fusion plasma simulations","authors":"Aditya Kashi, Pratik Nayak, Dhruva Kulkarni, A. Scheinberg, Paul Lin, H. Anzt","doi":"10.1109/ipdps53621.2022.00024","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00024","url":null,"abstract":"Batched linear solvers, which solve many small related but independent problems, are important in several applications. This is increasingly the case for highly parallel processors such as graphics processing units (GPUs), which need a substantial amount of work to keep them operating efficiently and solving smaller problems one-by-one is not an option. Because of the small size of each problem, the task of coming up with a parallel partitioning scheme and mapping the problem to hardware is not trivial. In recent history, significant attention has been given to batched dense linear algebra. However, there is also an interest in utilizing sparse iterative solvers in a batched form, and this presents further challenges. An example use case is found in a gyrokinetic Particle-In-Cell (PIC) code used for modeling magnetically confined fusion plasma devices. The collision operator has been identified as a bottleneck, and a proxy app has been created for facilitating optimizations and porting to GPUs. The current collision kernel linear solver does not run on the GPU-a major bottleneck. As these matrices are well-conditioned, batched iterative sparse solvers are an attractive option. A batched sparse iterative solver capability has recently been developed in the Ginkgo library. In this paper, we describe how the software architecture can be used to develop an efficient solution for the XGC collision proxy app. Comparisons for the solve times on NVIDIA V100 and A100 GPUs and AMD MI100 GPUs with one dual-socket Intel Xeon Skylake CPU node with 40 OpenMP threads are presented for matrices representative of those required in the collision kernel of XGC. The results suggest that GINKGO's batched sparse iterative solvers are well suited for efficient utilization of the GPU for this problem, and the performance portability of Ginkgo in conjunction with Kokkos (used within XGC as the heterogeneous programming model) allows seamless execution for exascale oriented heterogeneous architectures at the various leadership supercomputing facilities.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116886192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Resource Utilization Aware Job Scheduling to Mitigate Performance Variability 资源利用感知作业调度以减轻性能变化

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00040

Daniel Nichols, Aniruddha Marathe, Kathleen Shoga, T. Gamblin, A. Bhatele

{"title":"Resource Utilization Aware Job Scheduling to Mitigate Performance Variability","authors":"Daniel Nichols, Aniruddha Marathe, Kathleen Shoga, T. Gamblin, A. Bhatele","doi":"10.1109/ipdps53621.2022.00040","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00040","url":null,"abstract":"Resource contention on high performance computing (HPC) platforms can lead to significant variation in application performance. When several jobs experience such large variations in run times, it can lead to less efficient use of system resources. It can also lead to users over-estimating their job's expected run time, which degrades the efficiency of the system scheduler. Mitigating performance variation on HPC platforms benefits end users and also enables more efficient use of system resources. In this paper, we present a pipeline for collecting and analyzing system and application performance data for jobs submitted over long periods of time. We use a set of machine learning (ML) models trained on this data to classify performance variation using current system counters. Additionally, we present a new resource-aware job scheduling algorithm that utilizes the ML pipeline and current system state to mitigate job variation. We evaluate our pipeline, ML models, and scheduler using various proxy applications and an actual implementation of the scheduler on an Infiniband-based fat-tree cluster.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125312223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

As easy as ABC: Optimal (A)ccountable (B)yzantine (C)onsensus is easy! 像ABC一样简单:最优(A)可问责(B)拜占庭(C)共识很容易!

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00061

Pierre Civit, Seth Gilbert, V. Gramoli, R. Guerraoui, Jovan Komatovic

{"title":"As easy as ABC: Optimal (A)ccountable (B)yzantine (C)onsensus is easy!","authors":"Pierre Civit, Seth Gilbert, V. Gramoli, R. Guerraoui, Jovan Komatovic","doi":"10.1109/ipdps53621.2022.00061","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00061","url":null,"abstract":"It is known that the agreement property of the Byzantine consensus problem among $n$ processes can be violated in a non-synchronous system if the number of faulty processes exceeds $t_{0}$ = ┌$n$/3┐ − 1 [10], [19]. In this paper, we investigate the accountable Byzantine consensus problem in non-synchronous systems: the problem of solving Byzantine consensus whenever possible (e.g., when the number of faulty processes does not exceed $t_{0}$) and allowing correct processes to obtain proof of culpability of (at least) $t_{0}+ 1$ faulty processes whenever correct processes disagree. We present four complementary contributions: 1) We introduce ABC: a simple yet efficient transformation of any Byzantine consensus protocol to an accountable one. ABC introduces an overhead of only two all-to-all communication rounds and $O(n^{2})$ additional bits in executions with up to $t_{0}$ faults (i.e., in the common case). 2) We define the accountability complexity, a complex-ity metric representing the number of accountability-specific messages that correct processes must send. Fur-thermore, we prove a tight lower bound. In particular, we show that any accountable Byzantine consensus protocol incurs cubic accountability complexity. Moreover, we illustrate that the bound is tight by applying the ABC transformation to any Byzantine consensus protocol. 3) We demonstrate that, when applied to an optimal Byzan-tine consensus protocol, ABC constructs an accountable Byzantine consensus protocol that is (1) optimal with respect to the communication complexity in solving consensus whenever consensus is solvable, and (2) op-timal with respect to the accountability complexity in obtaining accountability whenever disagreement occurs. 4) We generalize ABC to other distributed computing prob-lems besides the classic consensus problem. We charac-terize a class of agreement tasks, including reliable and consistent broadcast [5], that ABC renders accountable.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127510183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

SFP: Service Function Chain Provision on Programmable Switches for Cloud Tenants SFP:为云租户提供可编程交换机业务功能链

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00123

Hongyi Huang, Wenfei Wu, Yongchao He, Zehua Guo

引用次数: 2

MLCNN: Cross-Layer Cooperative Optimization and Accelerator Architecture for Speeding Up Deep Learning Applications MLCNN:加速深度学习应用的跨层协同优化和加速器架构

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00118

Beilei Jiang, Xianwei Cheng, Sihai Tang, Xu Ma, Zhaochen Gu, Song Fu, Qing Yang, Ming-Qing Liu

{"title":"MLCNN: Cross-Layer Cooperative Optimization and Accelerator Architecture for Speeding Up Deep Learning Applications","authors":"Beilei Jiang, Xianwei Cheng, Sihai Tang, Xu Ma, Zhaochen Gu, Song Fu, Qing Yang, Ming-Qing Liu","doi":"10.1109/ipdps53621.2022.00118","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00118","url":null,"abstract":"The ever-increasing number of layers, millions of parameters, and large data volume make deep learning workloads resource-intensive and power-hungry. In this paper, we develop a convolutional neural network (CNN) acceleration framework, named MLCNN, which explores algorithm-hardware co-design to achieve cross-layer cooperative optimization and acceleration. MLCNN dramatically reduces computation and on-off chip communication, improving CNN's performance. To achieve this, MLCNN reorders the position of nonlinear activation layers and pooling layers, which we prove results in a negligible accuracy loss; then the convolutional layer and pooling layer are co-optimized by means of redundant multiplication elimination, local addition reuse, and global addition reuse. To the best of our knowledge, MLCNN is the first of its kind that incorporates cooperative optimization across convolutional, activation, and pooling layers. We further customize the MLCNN accelerator to take full advantage of cross-layer CNN optimization to reduce both computation and on-off chip communication. Our analysis shows that MLCNN can significantly reduce (up to 98%) multiplications and additions. We have implemented a prototype of MLCNN and evaluated its performance on several widely used CNN models using both an accelerator-level cycle and energy model and RTL implementation. Experimental results show that MLCNN achieves 3.2x speedup and 2.9x energy efficiency compared with dense CNNs. MLCNN's optimization methods are orthogonal to other CNN acceleration techniques, such as quantization and pruning. Combined with quantization, our quantized MLCNN gains a 12.8x speedup and 11.3x energy efficiency compared with DCNN.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121874376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

ParaTreeT: A Fast, General Framework for Spatial Tree Traversal ParaTreeT:一个快速、通用的空间树遍历框架

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00079

Joseph Hutter, J. Szaday, Jaemin Choi, Simeng Liu, L. Kalé, S. Wallace, Thomas R. Quinn

{"title":"ParaTreeT: A Fast, General Framework for Spatial Tree Traversal","authors":"Joseph Hutter, J. Szaday, Jaemin Choi, Simeng Liu, L. Kalé, S. Wallace, Thomas R. Quinn","doi":"10.1109/ipdps53621.2022.00079","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00079","url":null,"abstract":"Tree-based algorithms for spatial domain applications scale poorly in the distributed setting without extensive experimentation and optimization. Reusability via well-designed parallel abstractions supported by efficient parallel algorithms is therefore desirable. We present ParaTreeT, a parallel tree toolkit for state-of-the-art performance and programmer productivity. ParaTreeT leverages a novel shared-memory software cache to reduce communication volume and idle time throughout traversal. By dividing particles and subtrees across processors independently, it improves decomposition and limits synchro-nization during tree build. Tree-node states are extracted from the particle set with the Data abstraction, and traversal work and pruning are defined by the Visitor abstraction. ParaTreeT provides built-in trees, decompositions, and traversals that offer application-specific customization. We demonstrate ParaTreeT's improved computational performance over even specialized codes with multiple applications on CPUs. We evaluate how several applications derive benefit from ParaTreeT's models while pro-viding new insights to these workloads through experimentation.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122746529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Quantitative Study of the Spatiotemporal I/O Burstiness of HPC Application 高性能计算应用时空I/O突发性的定量研究

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00133

Wenxiang Yang, Xiangke Liao, Dezun Dong, Jie Yu

{"title":"A Quantitative Study of the Spatiotemporal I/O Burstiness of HPC Application","authors":"Wenxiang Yang, Xiangke Liao, Dezun Dong, Jie Yu","doi":"10.1109/ipdps53621.2022.00133","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00133","url":null,"abstract":"Understanding the I/O characteristics of applications on supercomputers is crucial to paving the path for application optimization and system resource allocation. We collect and analyze I/O traces of applications on a production supercomputer and reconfirm that I/O bursts exist in most applications. What's more, we find that the I/O bursts not only occur in short periods of time but also originate from a minority of adjacent compute nodes allocated to the applications, which we call spatiotemporal I/O burstiness. The concentration of I/O traffic in both time and space dimension will make applications experience poor I/O performance and incur I/O inefficiency of the storage system. Although there are some solutions, such as burst buffer, can help alleviate such inefficiency, there is still no work that measures, analyzes and further predicts the application I/O characteristic in terms of spatiotemporal burstiness, which we think is vital for application-aware optimizations, including but not limited to burst buffer allocation and job scheduling. In this paper, we first propose a mathematical model to measure the spatiotemporal I/O burstiness. Then a thorough analysis on the spatiotemporal I/O characteristic of all applications on the system is elaborated. We further make use of the job's submitting path to explore the I/O characteristic similarity among jobs, based on which a machine learning classification algorithm is proposed to accurately predict the job spatiotemporal I/O burstiness in advance. With accurate job I/O characteristic at hand, some useful suggestions are put forward to hedge the impacts of the spatiotemporal I/O burstiness.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114970326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Scaling and Selecting GPU Methods for All Pairs Shortest Paths (APSP) Computations 全对最短路径(APSP)计算的缩放和选择GPU方法

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00027

Yang Xia, Peng Jiang, G. Agrawal, R. Ramnath

引用次数: 1

Traffic-Optimal Virtual Network Function Placement and Migration in Dynamic Cloud Data Centers 动态云数据中心流量优化的虚拟网络功能布局与迁移

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00094

Vincent Tran, Jingsong Sun, Bin Tang, Deng Pan

{"title":"Traffic-Optimal Virtual Network Function Placement and Migration in Dynamic Cloud Data Centers","authors":"Vincent Tran, Jingsong Sun, Bin Tang, Deng Pan","doi":"10.1109/ipdps53621.2022.00094","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00094","url":null,"abstract":"We propose a new algorithmic framework for traffic-optimal virtual network function (VNF) placement and migration for policy-preserving data centers (PPDCs). As dynamic virtual machine (VM) traffic must traverse a sequence of VNFs in PPDCs, it generates more network traffic, consumes higher bandwidth, and causes additional traffic delays than a traditional data center. We design optimal, approximation, and heuristic traffic-aware VNF placement and migration algorithms to minimize the total network traffic in the PPDC. In particular, we propose the first traffic-aware constant-factor approximation algorithm for VNF placement, a Pareto-optimal solution for VNF migration, and a suite of efficient dynamic-programming (DP)-based heuristics that further improves the approximation solution. At the core of our framework are two new graph-theoretical problems that have not been studied. Using flow characteristics found in production data centers and realistic traffic patterns, we show that a) our VNF migration techniques are effective in mitigating dynamic traffic in PPDCs, reducing the total traffic cost by up to 73%, b) our VNF placement algorithms yield traffic costs 56% to 64% smaller than those by existing techniques, and c) our VNF migration algorithms outperform the state-of-the-art VM migration algorithms by up to 63% in reducing dynamic network traffic.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130668220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Optimal Arbitrary Pattern Formation on a Grid by Asynchronous Autonomous Robots 异步自主机器人在网格上的最优任意图案形成

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00115

Rory Hector, Gokarna Sharma, R. Vaidyanathan, J. Trahan

{"title":"Optimal Arbitrary Pattern Formation on a Grid by Asynchronous Autonomous Robots","authors":"Rory Hector, Gokarna Sharma, R. Vaidyanathan, J. Trahan","doi":"10.1109/ipdps53621.2022.00115","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00115","url":null,"abstract":"We consider the distributed setting of $N$ autonomous mobile robots that operate in Look-Compute-Move (LCM) cycles following either the robots with lights model or the classical oblivious robots model. For the lights model, we assume obstructed visibility so that a robot cannot see another robot if a third robot is positioned between them on the straight line connecting them. In contrast, we assume unobstructed visibility in the classical model so that a robot sees all others irrespective of their positions. In addition, we consider a grid-based terrain embedded in the 2-dimensional Euclidean plane that restricts each robot's movement to one of the four neighboring grid points from its current position. This grid setting is a natural discretization of the 2-dimensional real plane and extends the robot swarm model in directions of greater applicability. The Arbitrary Pattern Formationproblem is to relocate the $N$ robots (starting at arbitrary but distinct initial positions on a grid) to form an arbitrary target pattern given as input. In this paper, we provide two asynchronous algorithms for Arbitrary Pattern Formation, one on the lights model and another on the classical model. Key measures of the algorithms' performance include the time taken and the number of moves by each robot. Both algorithms run in $O(max{D^{i}, D^{p}})$ time with $O(max{D^{i}, D^{p}})$ moves by each robot, where $D^{i}$ and $D^{p}$, respectively, are the diameters of the initial and pattern configurations. The algorithm for the lights model uses $O(1)$ colors. We also prove a lower bound of $Omega(max{D^{i}, D^{p}})$ for time for any Arbitrary Pattern Formationalgorithm if scaling is not allowed on the target pattern. Therefore, our algorithms are optimal w.r.t. time. Furthermore, our algorithms are also optimal w.r.t. the number of moves given the existing lower bound of $Omega(max{D^{i}, D^{p}})$ on the number of moves. In sum, our results show that having lights provides a trade-off on the unobstructed visibility requirement in the classical model for Arbitrary Pattern Formation.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133554095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5