2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第6页

An Integral-equation-oriented Vectorized SpMV Algorithm and its Application on CT Imaging Reconstruction 面向积分方程的矢量化SpMV算法及其在CT成像重建中的应用

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00080

Weicai Ye, Chenghuan Huang, Jiasheng Huang, Jiajun Li, Yao Lu, Ying Jiang

{"title":"An Integral-equation-oriented Vectorized SpMV Algorithm and its Application on CT Imaging Reconstruction","authors":"Weicai Ye, Chenghuan Huang, Jiasheng Huang, Jiajun Li, Yao Lu, Ying Jiang","doi":"10.1109/ipdps53621.2022.00080","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00080","url":null,"abstract":"Sparse-matrix vector multiplication (SpMV) is a core routine in many applications. Its performance is limited by memory bandwidth, which is for matrix transport between processors and memory, and instruction latency in computations. Vectorized operations (SIMD) can dramatically improve the execution efficiency, but irregular matrices' sparsity pattern is not compatible with the style of SIMD execution. We present a new matrix format, Compressed Sparse Column Vector (CSCV), and a corresponding vectorized SpMV algorithm for matrices arising from integral equations. This SpMV algorithm can inherently suit wide SIMD instructions and reduce the memory bandwidth used. We implement this algorithm for Computed Tomography (CT) imaging reconstructions on both Intel and AMD x86 platforms and compare it with seven state-of-the-art SpMV implementations using different CT imaging matrices. Experimental results show that CSCV can achieve up to 96.9 GFLOP/s in single-precision tests, with speedup 3.70× to MKL and 3.48× to the second place implementation. Furthermore, the implementation of CSCV SpMV is performance portable, which excludes almost all SIMD assemble code and has promising performance with compiler-assisted vectorization. Code Availability: https://github.com/sysu-compsci/cscv","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124019769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Archpipe: Fast and Flexible Pipelined Erasure-coded Archival Scheme for Heterogeneous Networks Archpipe:快速灵活的异构网络的管道擦除编码归档方案

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00132

Bin Xu, Jianzhong Huang, X. Qin, Q. Cao, Yuanyuan Dong, Weikang Kong

{"title":"Archpipe: Fast and Flexible Pipelined Erasure-coded Archival Scheme for Heterogeneous Networks","authors":"Bin Xu, Jianzhong Huang, X. Qin, Q. Cao, Yuanyuan Dong, Weikang Kong","doi":"10.1109/ipdps53621.2022.00132","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00132","url":null,"abstract":"Erasure-coded archival converts the redundancy mechanism of low access-frequency data from replication to erasure coding for balancing access performance and storage efficiency. A variety of pipelined schemes are designed to speed up the archival operation, however they neglect such three factors as heterogeneous network, under-utilization of replica resources and tight coupling with underlying platforms which restrict or even negate the performance gains. In this paper, we propose Archpipe, a fast and flexible pipelined erasure-coded archival scheme. It exhibits three distinct features: 1) heterogeneous network awareness, for a single-pipelined construction, sufficient-bandwidth links are given high scheduling priority to avoid network congestion, while considering locality to reducing network transmissions; 2) parallel encoding, the unused replica resources are exploited to adaptively construct multiple pipelines for each stripe based on the single-pipelined algorithm, thereby enabling parity blocks to be encoded in parallel; 3) loose coupling, it does not rely on specific block placement policies and stripe construction algorithms. Experimental results indicate that, Archpipe can be seamlessly integrated with common distributed storage systems, and it improves the erasure-coded archival performance by 3.6 ∼ 4.7× and 1.3 ∼ 2.6× in on-disk and in-memory scenarios, respectively.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115519938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CSMV: A Highly Scalable Multi-Versioned Software Transactional Memory for GPUs CSMV:用于gpu的高度可扩展的多版本软件事务性内存

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00057

D. Nunes, Daniel Castro, P. Romano

{"title":"CSMV: A Highly Scalable Multi-Versioned Software Transactional Memory for GPUs","authors":"D. Nunes, Daniel Castro, P. Romano","doi":"10.1109/ipdps53621.2022.00057","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00057","url":null,"abstract":"GPUs have traditionally focused on streaming applications with regular parallelism. Over the last years, though, GPUs have also been successfully used to accelerate irregular applications in a number of application domains by using fine grained synchronization schemes. Unfortunately, fine-grained synchronization strategies are notoriously complex and error-prone. This has motivated the search for alternative paradigms aimed to simplify concurrent programming and, among these, Transactional Memory (TM) is probably one of the most prominent proposals. This paper introduces CSMV (Client Server Multiversioned), a multi-versioned Software TM (STM) for GPUs that adopts an innovative client-server design. By decoupling the execution of transactions from their commit process, CSMV provides two main benefits: (i) it enables the use of fast on chip memory to access the global metadata used to synchronize transaction (ii) it allows for implementing highly efficient collaborative commit procedures, tailored to take full advantage of the architectural characteristics of GPUs. Via an extensive experimental study, we show that CSMV achieves up to 3 orders of magnitude speed-ups with respect to state of the art STMs for GPUs and that it can accelerate by up to 20× irregular applications running on state of the art STMs for CPUs.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123299968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Mixed Precision $s$-step Conjugate Gradient with Residual Replacement on GPUs gpu上混合精度$s$步共轭梯度残差替换

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00091

I. Yamazaki, E. Carson, Brian Kelley

{"title":"Mixed Precision $s$-step Conjugate Gradient with Residual Replacement on GPUs","authors":"I. Yamazaki, E. Carson, Brian Kelley","doi":"10.1109/ipdps53621.2022.00091","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00091","url":null,"abstract":"The $s$-step Conjugate Gradient (CG) algorithm has the potential to reduce the communication cost of standard CG by a factor of $s$. However, though mathematically equivalent, $s$-step CG may be numerically less stable compared to standard CG in finite precision, exhibiting slower convergence and decreased attainable accuracy. This limits the use of $s$-step CG in practice. To improve the numerical behavior of $s$-step CG and overcome this potential limitation, we incorporate two techniques. First, we improve convergence behavior through the use of higher precision at critical parts of the $s$-step iteration and second, we integrate a residual replacement strategy into the resulting mixed precision $s$-step CG to improve attainable accuracy. Our experimental results on the Summit Supercomputer demonstrate that when the higher precision is implemented in hardware, these techniques have virtually no overhead on the iteration time while improving both the convergence rate and the attainable accuracy of $s$-step CG. Even when the higher precision is implemented in software, these techniques may still reduce the time-to-solution (speedups of up to $1.8times$ in our experiments), especially when $s$-step CG suffers from numerical instability with a small step size and the latency cost becomes a significant part of its iteration time.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129783120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Top-Down Performance Profiling on NVIDIA's GPUs NVIDIA gpu自上而下的性能分析

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00026

Álvaro Sáiz, P. Prieto, Pablo Abad Fidalgo, J. Gregorio, Valentin Puente

{"title":"Top-Down Performance Profiling on NVIDIA's GPUs","authors":"Álvaro Sáiz, P. Prieto, Pablo Abad Fidalgo, J. Gregorio, Valentin Puente","doi":"10.1109/ipdps53621.2022.00026","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00026","url":null,"abstract":"The rise of data-intensive algorithms, such as Machine Learning ones, has meant a strong diversification of Graphics Processing Units (GPU) in fields with intensive Data-Level Parallelism. This trend, known as general-purpose computing on GPU (GP-GPU), makes the execution process on a GPU (seemingly simple in its architecture) far from trivial when targeting performance for many dissimilar applications. A proof of this is the existence of many profiling tools that help programmers to understand how to maximize hardware utilization. In contrast, this paper proposes a profiling tool focused on microarchitecture analysis under large sets of dissimilar applications. Therefore, the tool has a double objective. On the one hand, to check the suitability of a GPU for diverse sets of application kernels. On the other hand, to identify possible bottlenecks in a given GPU microarchitecture, facilitating the improvement of subsequent designs. For this purpose, using Top-Down methodology proposed by Intel for their CPUs as inspiration, we have defined a hierarchical organization for the execution pipeline of the GPU. The proposal makes use of the available hardware performance counters to identify how each component contributes to performance losses. We demonstrate the feasibility of the proposed methodology, analyzing how different modern NVIDIA architectures behave running relevant benchmarks, assessing in which microarchitecture component performance losses are the most significant.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128716482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Compiler-Directed Incremental Checkpointing for Low Latency GPU Preemption 编译器定向增量检查点低延迟GPU抢占

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00078

Zhuoran Ji, Cho-Li Wang

{"title":"Compiler-Directed Incremental Checkpointing for Low Latency GPU Preemption","authors":"Zhuoran Ji, Cho-Li Wang","doi":"10.1109/ipdps53621.2022.00078","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00078","url":null,"abstract":"GPUs are widely used in data centers to accelerate data-parallel applications. The multiuser and multitasking environment provides a strong incentive for preemptive GPU multitasking, especially for latency-sensitive jobs. Due to the large contexts of GPU kernels, preemptive GPU context switching is costly. Many novel GPU preemption techniques are proposed. Among them, checkpoint-based GPU preemption enables low latency GPU preemption but incurs a high runtime overhead. Prior studies propose to exclude dead registers from the checkpoint file to reduce the runtime overhead. It works well for CPUs, but it is not rare that a live register is not updated between two checkpoints for GPU kernels. This paper presents TripleC, a compiler-directed incremental checkpointing technique specially designed for GPU preemption. It further excludes the registers, which have not been overwritten since the last time they were spilled, from the checkpoint file with data flow analysis. The checkpoint placement algorithm of TripleC can properly estimate a checkpoint's cost under incremental checkpointing. It also considers the interaction among checkpoints so that the overall cost is minimized. Moreover, TripleC relaxes the conventional checkpointing constraint that the whole register context must be spilled before passing the checkpoint. Because of the diverse control flow, placing a register spilling instruction at different points incurs different costs. TripleC minimizes the cost with a two-phase algorithm that schedules these register spilling instructions at compilation time. Evaluations show that TripleC reduces the runtime overhead by 12.9 % on average compared with the state-of-the-art non-incremental checkpointing approach.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"362 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122446868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Preprocessing Pipeline Optimization for Scientific Deep Learning Workloads 面向科学深度学习工作负载的预处理流水线优化

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00112

K. Ibrahim, L. Oliker

引用次数: 1

DENOVA: Deduplication Extended NOVA File System DENOVA:重复数据删除扩展NOVA文件系统

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00134

Hyungjoon Kwon, Yonghyeon Cho, Awais Khan, Yeohyeon Park, Youngjae Kim

引用次数: 4

Exploring Efficient Microservice Level Parallelism 探索高效的微服务级并行性

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00030

Xinkai Wang, Chao Li, Lu Zhang, Xiaofeng Hou, Quan Chen, Minyi Guo

{"title":"Exploring Efficient Microservice Level Parallelism","authors":"Xinkai Wang, Chao Li, Lu Zhang, Xiaofeng Hou, Quan Chen, Minyi Guo","doi":"10.1109/ipdps53621.2022.00030","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00030","url":null,"abstract":"The microservice architecture has recently become a driving trend in the cloud by disaggregating a monolithic application into many scenario-oriented service blocks (microservices). The decomposition process results in a highly dynamic execution scenario, in which various chained microservices contend for computing resources in different ways. While parallelism has been exploited at both the instruction/thread level and the task/request level, very limited work has been done with the grain-size of a microservice. Current parallel processing solutions are sub-optimal as they neither capture the unique characteristics of microservices nor consider the uncertainty arises in the microservice environment. In this work we introduce microservice level parallelism (MLP), a technique that aims to precisely coalesce and align parallel microservice chains for better system performance and resource utilization. We identify major issues that prevent servers from effectively exploiting MLP and we define metrics that can guide MLP optimization. We propose v-MLP, a volatility-aware MLP that is able to adapt to a highly heterogeneous and dynamic microservice environment. We show that v-MLP can reduce tail latency by up to 50% and improve resource utilization by up to 15 % under various scenarios.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134303845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

PARSEC: PARallel Subgraph Enumeration in CUDA 并行子图枚举在CUDA

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00025

Vibhor Dodeja, M. Almasri, R. Nagi, Jinjun Xiong, Wen-mei W. Hwu

{"title":"PARSEC: PARallel Subgraph Enumeration in CUDA","authors":"Vibhor Dodeja, M. Almasri, R. Nagi, Jinjun Xiong, Wen-mei W. Hwu","doi":"10.1109/ipdps53621.2022.00025","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00025","url":null,"abstract":"Subgraph enumeration is an important problem in the field of Graph Analytics with numerous applications. The problem is provably NP-complete and requires sophisticated heuristics and highly efficient implementations to be feasible on problem sizes of realistic scales. Parallel solutions have shown a lot of promise on CPUs and distributed environments. Recently, GPU-based parallel solutions have also been proposed to take advantage of the massive execution resources in modern GPUs. Subgraph enumeration involves traversing a search tree for each vertex of the data graph to find matches of a query in a graph. Most GPU-based solutions traverse the tree in breadth-first manner that exploits parallelism at the cost of high memory requirement and presents a formidable challenge for processing large graphs with high-degree vertices since the memory capacity of GPUs is significantly lower than that of CPUs. In this work, we propose a novel GPU solution based on a hybrid BFS and DFS approach where the top level(s) of the search trees are traversed in a fully parallel, breadth-first manner while each subtree is traversed in a more space-efficient, depth-first manner. The depth-first traversal of subtrees requires less memory but presents more challenges for parallel execution. To overcome the less parallel nature of depth-first traversal, we exploit fine-grained parallelism in each step of the depth-first traversal of sub-trees. We further identify and implement various optimizations to efficiently utilize memory and compute resources of the GPUs. We evaluate our performance in comparison with the state-of-the-art GPU and CPU implementations. We outperform the GPU and CPU implementations with a geometric mean speedup of 9.47× (up to 92.01×) and 2.37× (up to 12.70×), respectively. We also show that the proposed approach can efficiently process the graphs that previously cannot be processed by the state-of-the-art GPU solutions due to their excessive memory requirement.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131820352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4