Workshop Proceedings of the 51st International Conference on Parallel Processing最新文献_第2页

The Support of MLIR HLS Adaptor for LLVM IR MLIR HLS适配器对LLVM IR的支持

Workshop Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548515

Geng-Ming Liang, Chuan-Yue Yuan, Meng-Shiun Yuan, Tai-Liang Chen, Kuan-Hsun Chen, Jenq-Kuen Lee

引用次数: 1

Structured Concurrency: A Review 结构化并发:综述

Workshop Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548519

Yi-An Chen, Yi-Ping You

{"title":"Structured Concurrency: A Review","authors":"Yi-An Chen, Yi-Ping You","doi":"10.1145/3547276.3548519","DOIUrl":"https://doi.org/10.1145/3547276.3548519","url":null,"abstract":"Today, mobile applications use thousands of concurrent tasks to process multiple sensor inputs to ensure a better user experience. With this demand, the ability to manage these concurrent tasks efficiently and easily is becoming a new challenge, especially in their lifetimes. Structured concurrency is a technique that reduces the complexity of managing a large number of concurrent tasks. There have been several languages or libraries (e.g., Kotlin, Swift, and Trio) that support such a paradigm for better concurrency management. It is worth noting that structured concurrency has been consistently implemented on top of coroutines across all these languages and libraries. However, there are no documents or studies in the literature that indicate why and how coroutines are relevant to structured concurrency. In contrast, the mainstream community views structured concurrency as a successor to structured programming; that is, the concept of “structure” extends from ordinary programming to concurrent programming. Nevertheless, such a viewpoint does not explain, as the concept of structured concurrency came out more than 40 years later after structured programming was introduced in the early 1970s, whereas concurrent programming started in the 1960s. In this paper, we introduce a new theory to complement the origin of structured concurrency from historical and technical perspectives—it is the foundation established by coroutines that gives birth to structured concurrency.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127440949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 在功率上限下优化现代gpu的硬件资源分区和任务分配

Workshop Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548630

Eishi Arima, Minjoon Kang, Issa Saba, J. Weidendorfer, C. Trinitis, Martin Schulz

{"title":"Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps","authors":"Eishi Arima, Minjoon Kang, Issa Saba, J. Weidendorfer, C. Trinitis, Martin Schulz","doi":"10.1145/3547276.3548630","DOIUrl":"https://doi.org/10.1145/3547276.3548630","url":null,"abstract":"CPU-GPU heterogeneous systems are now commonly used in HPC (High-Performance Computing). However, improving the utilization and energy-efficiency of such systems is still one of the most critical issues. As one single program typically cannot fully utilize all resources within a node/chip, co-scheduling (or co-locating) multiple programs with complementary resource requirements is a promising solution. Meanwhile, as power consumption has become the first-class design constraint for HPC systems, such co-scheduling techniques should be well-tailored for power-constrained environments. To this end, the industry recently started supporting hardware-level resource partitioning features on modern GPUs for realizing efficient co-scheduling, which can operate with existing power capping features. For example, NVidia’s MIG (Multi-Instance GPU) partitions one single GPU into multiple instances at the granularity of a GPC (Graphics Processing Cluster). In this paper, we explicitly target the combination of hardware-level GPU partitioning features and power capping for power-constrained HPC systems. We provide a systematic methodology to optimize the combination of chip partitioning, job allocations, as well as power capping based on our scalability/interference modeling while taking a variety of aspects into account, such as compute/memory intensity and utilization in heterogeneous computational resources (e.g., Tensor Cores). The experimental result indicates that our approach is successful in selecting a near optimal combination across multiple different workloads.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130408063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Designing Hierarchical Multi-HCA Aware Allgather in MPI MPI中分层多hca感知集合的设计

Workshop Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548524

Tu Tran, Benjamin Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, D. Panda

{"title":"Designing Hierarchical Multi-HCA Aware Allgather in MPI","authors":"Tu Tran, Benjamin Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, D. Panda","doi":"10.1145/3547276.3548524","DOIUrl":"https://doi.org/10.1145/3547276.3548524","url":null,"abstract":"To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, resulting in a ”multi-rail” network. The second and third-placed systems of the Top500 use two adapters per node; recently, the ThetaGPU system at Argonne National Laboratory (ANL) uses eight adapters per node. With such an availability of networking resources, it is a non-trivial task to utilize all of them. The Message Passing Interface (MPI) is a dominant model for high-performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we take up this task and propose hierarchical, multi-HCA aware Allgather designs; Allgather is a communication-intensive collective widely used in applications like matrix multiplication and other collectives. The proposed designs fully utilize all the available network adapters within a node and provides high overlap between inter-node and intra-node communication. At the micro-benchmark level, our new schemes achieve performance improvement for both single node and multiple node communication. We see inter-node improvements up to 62% and 61% better than HPC-X and MVAPICH2-X for 1024 processes. The design for inter-node communication also boosts the performance of Ring Allreduce by 56% and 44% compared to HPC-X and MVAPICH2-X. At the application level, the enhanced Allgather shows 1.98x and 1.42x improvement in a matrix-vector multiplication kernel when compared to HPC-X and MVAPICH2-X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2-X.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"180 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134311827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Execution Flow Aware Profiling for ROS-based Autonomous Vehicle Software 基于ros的自动驾驶软件执行流程感知分析

Workshop Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548516

Shao-Hua Wang, Chia-Heng Tu, C. Huang, J. Juang

引用次数: 0

A Software/Hardware Co-design Local Irregular Sparsity Method for Accelerating CNNs on FPGA 在FPGA上加速cnn的软硬件协同设计局部不规则稀疏度方法

Workshop Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548521

Jiangwei Shang, Zhan Zhang, Chuanyou Li, Kun Zhang, Lei Qian, Hongwei Liu

{"title":"A Software/Hardware Co-design Local Irregular Sparsity Method for Accelerating CNNs on FPGA","authors":"Jiangwei Shang, Zhan Zhang, Chuanyou Li, Kun Zhang, Lei Qian, Hongwei Liu","doi":"10.1145/3547276.3548521","DOIUrl":"https://doi.org/10.1145/3547276.3548521","url":null,"abstract":"Convolutional neural networks (CNNs) have been widely used in different areas. The success of CNNs comes with a huge amount of parameters and computations, and nowaday CNNs still keep moving toward larger structures. Although larger structures often bring about better inference accuracy, the increasing size also slows the inference speed down. Recently, various parameter sparsity methods have been proposed to accelerate CNNs by reducing the number of parameters and computations. Existing sparsity methods could be classified into two categories: unstructured and structured. Unstructured sparsity methods easily cause irregularity and thus have a suboptimal speedup. On the other hand, the structured sparsity methods could keep regularity by pruning the parameters following a certain pattern but result in low sparsity. In this paper, we propose a software/hardware co-design approach to bring local irregular sparsity into CNNs. Benefiting from the local irregularity, we design a row-wise computing engine, RConv Engine, to achieve workload balance and remarkable speedup. The experimental results show that our software/hardware co-design method can achieve a 10.9x speedup than the state-of-the-art methods with a negligible accuracy loss.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115241898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Two-Stage Pre-processing for License Recognition 许可证识别的两阶段预处理

Workshop Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548441

J. Zhang, Cheng-Tsung Chan, Minmin Sun

引用次数: 0

A framework for low communication approaches for large scale 3D convolution 一种用于大规模三维卷积的低通信方法框架

Workshop Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548626

Anuva Kulkarni, Jelena Kovacevic, F. Franchetti

{"title":"A framework for low communication approaches for large scale 3D convolution","authors":"Anuva Kulkarni, Jelena Kovacevic, F. Franchetti","doi":"10.1145/3547276.3548626","DOIUrl":"https://doi.org/10.1145/3547276.3548626","url":null,"abstract":"Large-scale 3D convolutions computed using parallel Fast Fourier Transforms (FFTs) demand multiple all-to-all communication steps, which cause bottlenecks on computing clusters. Since data transfer speeds to/from memory have not increased proportionally to computational capacity (in terms of FLOPs), 3D FFTs become bounded by communication and are difficult to scale, especially on modern heterogeneous computing platforms consisting of accelerators like GPUs. Existing HPC frameworks focus on optimizing the isolated FFT algorithm or communication patterns, but still require multiple all-to-all communication steps during convolution. In this work, we present a strategy for scalable convolution such that it avoids multiple all-to-all exchanges, and also optimizes necessary communication. We provide proof-of-concept results under assumptions of a use case, the MASSIF Hooke’s law simulation convolution kernel. Our method localizes computation by exploiting properties of the data, and approximates the convolution result by data compression, resulting in increased scalability of 3D convolution. Our preliminary results show scalability of 8 times more than traditional methods in the same compute resources without adversely affecting result accuracy. Our method can be adapted for first-principle scientific simulations and leverages cross-disciplinary knowledge of the application, the data and computing to perform large-scale convolution while avoiding communication bottlenecks. In order to make our approach widely usable and adaptable for emerging challenges, we discuss the use of FFTX, a novel framework which can be used for platform-agnostic specification and optimization for algorithmic approaches similar to ours.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132073449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Pipeline Pattern Detection Technique in Polly 波利中的管道模式检测技术

Workshop Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548445

Delaram Talaashrafi, J. Doerfert, M. M. Maza

{"title":"A Pipeline Pattern Detection Technique in Polly","authors":"Delaram Talaashrafi, J. Doerfert, M. M. Maza","doi":"10.1145/3547276.3548445","DOIUrl":"https://doi.org/10.1145/3547276.3548445","url":null,"abstract":"The polyhedral model has repeatedly shown how it facilitates various loop transformations, including loop parallelization, loop tiling, and software pipelining. However, parallelism is almost exclusively exploited on a per-loop basis without much work on detecting cross-loop parallelization opportunities. While many problems can be scheduled such that loop dimensions are dependence-free, the resulting loop parallelism does not necessarily maximize concurrent execution, especially not for unbalanced problems. In this work, we introduce a polyhedral-model-based analysis and scheduling algorithm that exposes and utilizes cross-loop parallelization through tasking. This work exploits pipeline patterns between iterations in different loop nests, and it is well suited to handle imbalanced iterations. Our LLVM/Polly-based prototype performs schedule modifications and code generation targeting a minimal, language agnostic tasking layer. We present results using an implementation of this API with the OpenMP task construct. For different computation patterns, we achieved speed-ups of up to 3.5 × on a quad-core processor while LLVM/Polly alone fails to exploit the parallelism.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131955309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Frequency Recovery in Power Grids using High-Performance Computing 基于高性能计算的电网频率恢复

Workshop Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548632

Vishwas Rao, A. Subramanyam, Michel Schanen, Youngdae Kim, Ignas Šatkauskas, M. Anitescu

{"title":"Frequency Recovery in Power Grids using High-Performance Computing","authors":"Vishwas Rao, A. Subramanyam, Michel Schanen, Youngdae Kim, Ignas Šatkauskas, M. Anitescu","doi":"10.1145/3547276.3548632","DOIUrl":"https://doi.org/10.1145/3547276.3548632","url":null,"abstract":"Maintaining electric power system stability is paramount, especially in extreme contingencies involving unexpected outages of multiple generators or transmission lines that are typical during severe weather events. Such outages often lead to large supply-demand mismatches followed by subsequent system frequency deviations from their nominal value. The extent of frequency deviations is an important metric of system resilience, and its timely mitigation is a central goal of power system operation and control. This paper develops a novel nonlinear model predictive control (NMPC) method to minimize frequency deviations when the grid is affected by an unforeseen loss of multiple components. Our method is based on a novel multi-period alternating current optimal power flow (ACOPF) formulation that accurately models both nonlinear electric power flow physics and the primary and secondary frequency response of generator control mechanisms. We develop a distributed parallel Julia package for solving the large-scale nonlinear optimization problems that result from our NMPC method and thereby address realistic test instances on existing high-performance computing architectures. Our method demonstrates superior performance in terms of frequency recovery over existing industry practices, where generator levels are set based on the solution of single-period classical ACOPF models.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127699588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0