Workshop Proceedings of the 51st International Conference on Parallel Processing最新文献

筛选
英文 中文
The Support of MLIR HLS Adaptor for LLVM IR MLIR HLS适配器对LLVM IR的支持
Geng-Ming Liang, Chuan-Yue Yuan, Meng-Shiun Yuan, Tai-Liang Chen, Kuan-Hsun Chen, Jenq-Kuen Lee
{"title":"The Support of MLIR HLS Adaptor for LLVM IR","authors":"Geng-Ming Liang, Chuan-Yue Yuan, Meng-Shiun Yuan, Tai-Liang Chen, Kuan-Hsun Chen, Jenq-Kuen Lee","doi":"10.1145/3547276.3548515","DOIUrl":"https://doi.org/10.1145/3547276.3548515","url":null,"abstract":"Since the emergence of MLIR, High-level Synthesis (HLS) tools started to design in multi-level abstractions. Unlike the traditional HLS tools that are based on a single abstraction (e.g. LLVM), optimizations in different levels of abstraction could benefit from cross-layer optimizations to get better results. Although current HLS tools with MLIR can generate HLS C/C++ to do synthesis, we believe that a direct IR transformation from MLIR to LLVM will keep more expression details. In this paper, we propose an adaptor for LLVM IR, which can optimize the IR, generated from MLIR, into HLS readable IR. Without the gap of unsupported syntax between different versions, developers could focus on their specialization. Our preliminary results show that the MLIR flow via our adaptor can generate comparable performance results with the version by MLIR HLS tools generating HLS C++ codes. The experiment is performed with Xilinx Vitis and HLS tools.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128581841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Structured Concurrency: A Review 结构化并发:综述
Yi-An Chen, Yi-Ping You
{"title":"Structured Concurrency: A Review","authors":"Yi-An Chen, Yi-Ping You","doi":"10.1145/3547276.3548519","DOIUrl":"https://doi.org/10.1145/3547276.3548519","url":null,"abstract":"Today, mobile applications use thousands of concurrent tasks to process multiple sensor inputs to ensure a better user experience. With this demand, the ability to manage these concurrent tasks efficiently and easily is becoming a new challenge, especially in their lifetimes. Structured concurrency is a technique that reduces the complexity of managing a large number of concurrent tasks. There have been several languages or libraries (e.g., Kotlin, Swift, and Trio) that support such a paradigm for better concurrency management. It is worth noting that structured concurrency has been consistently implemented on top of coroutines across all these languages and libraries. However, there are no documents or studies in the literature that indicate why and how coroutines are relevant to structured concurrency. In contrast, the mainstream community views structured concurrency as a successor to structured programming; that is, the concept of “structure” extends from ordinary programming to concurrent programming. Nevertheless, such a viewpoint does not explain, as the concept of structured concurrency came out more than 40 years later after structured programming was introduced in the early 1970s, whereas concurrent programming started in the 1960s. In this paper, we introduce a new theory to complement the origin of structured concurrency from historical and technical perspectives—it is the foundation established by coroutines that gives birth to structured concurrency.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127440949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 在功率上限下优化现代gpu的硬件资源分区和任务分配
Eishi Arima, Minjoon Kang, Issa Saba, J. Weidendorfer, C. Trinitis, Martin Schulz
{"title":"Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps","authors":"Eishi Arima, Minjoon Kang, Issa Saba, J. Weidendorfer, C. Trinitis, Martin Schulz","doi":"10.1145/3547276.3548630","DOIUrl":"https://doi.org/10.1145/3547276.3548630","url":null,"abstract":"CPU-GPU heterogeneous systems are now commonly used in HPC (High-Performance Computing). However, improving the utilization and energy-efficiency of such systems is still one of the most critical issues. As one single program typically cannot fully utilize all resources within a node/chip, co-scheduling (or co-locating) multiple programs with complementary resource requirements is a promising solution. Meanwhile, as power consumption has become the first-class design constraint for HPC systems, such co-scheduling techniques should be well-tailored for power-constrained environments. To this end, the industry recently started supporting hardware-level resource partitioning features on modern GPUs for realizing efficient co-scheduling, which can operate with existing power capping features. For example, NVidia’s MIG (Multi-Instance GPU) partitions one single GPU into multiple instances at the granularity of a GPC (Graphics Processing Cluster). In this paper, we explicitly target the combination of hardware-level GPU partitioning features and power capping for power-constrained HPC systems. We provide a systematic methodology to optimize the combination of chip partitioning, job allocations, as well as power capping based on our scalability/interference modeling while taking a variety of aspects into account, such as compute/memory intensity and utilization in heterogeneous computational resources (e.g., Tensor Cores). The experimental result indicates that our approach is successful in selecting a near optimal combination across multiple different workloads.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130408063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Designing Hierarchical Multi-HCA Aware Allgather in MPI MPI中分层多hca感知集合的设计
Tu Tran, Benjamin Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, D. Panda
{"title":"Designing Hierarchical Multi-HCA Aware Allgather in MPI","authors":"Tu Tran, Benjamin Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, D. Panda","doi":"10.1145/3547276.3548524","DOIUrl":"https://doi.org/10.1145/3547276.3548524","url":null,"abstract":"To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, resulting in a ”multi-rail” network. The second and third-placed systems of the Top500 use two adapters per node; recently, the ThetaGPU system at Argonne National Laboratory (ANL) uses eight adapters per node. With such an availability of networking resources, it is a non-trivial task to utilize all of them. The Message Passing Interface (MPI) is a dominant model for high-performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we take up this task and propose hierarchical, multi-HCA aware Allgather designs; Allgather is a communication-intensive collective widely used in applications like matrix multiplication and other collectives. The proposed designs fully utilize all the available network adapters within a node and provides high overlap between inter-node and intra-node communication. At the micro-benchmark level, our new schemes achieve performance improvement for both single node and multiple node communication. We see inter-node improvements up to 62% and 61% better than HPC-X and MVAPICH2-X for 1024 processes. The design for inter-node communication also boosts the performance of Ring Allreduce by 56% and 44% compared to HPC-X and MVAPICH2-X. At the application level, the enhanced Allgather shows 1.98x and 1.42x improvement in a matrix-vector multiplication kernel when compared to HPC-X and MVAPICH2-X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2-X.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"180 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134311827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Execution Flow Aware Profiling for ROS-based Autonomous Vehicle Software 基于ros的自动驾驶软件执行流程感知分析
Shao-Hua Wang, Chia-Heng Tu, C. Huang, J. Juang
{"title":"Execution Flow Aware Profiling for ROS-based Autonomous Vehicle Software","authors":"Shao-Hua Wang, Chia-Heng Tu, C. Huang, J. Juang","doi":"10.1145/3547276.3548516","DOIUrl":"https://doi.org/10.1145/3547276.3548516","url":null,"abstract":"The complexity of the Robot Operating System (ROS) based autonomous software grows as autonomous vehicles get more intelligent. It is a big challenge for system designers to rapidly understand runtime behaviors and performance of such sophisticated software because the conventional tools are insufficient for characterizing the high-level interactions of the modules within the software. In this paper, a new graphical representation, execution flow graph, is devised to represent the execution sequences and related performance statistics of the ROS modules. The execution flow aware profiling is applied on the autonomous software, Autoware and Navigation Stack, with encouraging results.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"664 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116099578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Software/Hardware Co-design Local Irregular Sparsity Method for Accelerating CNNs on FPGA 在FPGA上加速cnn的软硬件协同设计局部不规则稀疏度方法
Jiangwei Shang, Zhan Zhang, Chuanyou Li, Kun Zhang, Lei Qian, Hongwei Liu
{"title":"A Software/Hardware Co-design Local Irregular Sparsity Method for Accelerating CNNs on FPGA","authors":"Jiangwei Shang, Zhan Zhang, Chuanyou Li, Kun Zhang, Lei Qian, Hongwei Liu","doi":"10.1145/3547276.3548521","DOIUrl":"https://doi.org/10.1145/3547276.3548521","url":null,"abstract":"Convolutional neural networks (CNNs) have been widely used in different areas. The success of CNNs comes with a huge amount of parameters and computations, and nowaday CNNs still keep moving toward larger structures. Although larger structures often bring about better inference accuracy, the increasing size also slows the inference speed down. Recently, various parameter sparsity methods have been proposed to accelerate CNNs by reducing the number of parameters and computations. Existing sparsity methods could be classified into two categories: unstructured and structured. Unstructured sparsity methods easily cause irregularity and thus have a suboptimal speedup. On the other hand, the structured sparsity methods could keep regularity by pruning the parameters following a certain pattern but result in low sparsity. In this paper, we propose a software/hardware co-design approach to bring local irregular sparsity into CNNs. Benefiting from the local irregularity, we design a row-wise computing engine, RConv Engine, to achieve workload balance and remarkable speedup. The experimental results show that our software/hardware co-design method can achieve a 10.9x speedup than the state-of-the-art methods with a negligible accuracy loss.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115241898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Two-Stage Pre-processing for License Recognition 许可证识别的两阶段预处理
J. Zhang, Cheng-Tsung Chan, Minmin Sun
{"title":"Two-Stage Pre-processing for License Recognition","authors":"J. Zhang, Cheng-Tsung Chan, Minmin Sun","doi":"10.1145/3547276.3548441","DOIUrl":"https://doi.org/10.1145/3547276.3548441","url":null,"abstract":"Various financial insurance and investment application websites require customers to upload identity documents, such as vehicle licenses, to verify their identities. Manual verification of these documents is costly. Hence, there is a clear demand for automatic document recognition. This study proposes a two-stage method to pre-process a vehicle license for a better text recognition. In the first stage, the distortion that often appears in photographed documents is repaired. In the second stage, each data field is carefully located. The subsequent captured fields are then processed by a commercial text recognition software. Due to the sensitivity of vehicle licenses, it is difficult to collect enough data for model training. Consequently, artificial vehicle licenses are synthesized for model training to mitigate overfitting. In addition, an encoder is applied to reduce the background noise, remove the border crossing over text, and make the blurred text clearer before text recognition. The proposed method on a real dataset shows that the accuracy is close to 90%.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116591890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A framework for low communication approaches for large scale 3D convolution 一种用于大规模三维卷积的低通信方法框架
Anuva Kulkarni, Jelena Kovacevic, F. Franchetti
{"title":"A framework for low communication approaches for large scale 3D convolution","authors":"Anuva Kulkarni, Jelena Kovacevic, F. Franchetti","doi":"10.1145/3547276.3548626","DOIUrl":"https://doi.org/10.1145/3547276.3548626","url":null,"abstract":"Large-scale 3D convolutions computed using parallel Fast Fourier Transforms (FFTs) demand multiple all-to-all communication steps, which cause bottlenecks on computing clusters. Since data transfer speeds to/from memory have not increased proportionally to computational capacity (in terms of FLOPs), 3D FFTs become bounded by communication and are difficult to scale, especially on modern heterogeneous computing platforms consisting of accelerators like GPUs. Existing HPC frameworks focus on optimizing the isolated FFT algorithm or communication patterns, but still require multiple all-to-all communication steps during convolution. In this work, we present a strategy for scalable convolution such that it avoids multiple all-to-all exchanges, and also optimizes necessary communication. We provide proof-of-concept results under assumptions of a use case, the MASSIF Hooke’s law simulation convolution kernel. Our method localizes computation by exploiting properties of the data, and approximates the convolution result by data compression, resulting in increased scalability of 3D convolution. Our preliminary results show scalability of 8 times more than traditional methods in the same compute resources without adversely affecting result accuracy. Our method can be adapted for first-principle scientific simulations and leverages cross-disciplinary knowledge of the application, the data and computing to perform large-scale convolution while avoiding communication bottlenecks. In order to make our approach widely usable and adaptable for emerging challenges, we discuss the use of FFTX, a novel framework which can be used for platform-agnostic specification and optimization for algorithmic approaches similar to ours.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132073449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Pipeline Pattern Detection Technique in Polly 波利中的管道模式检测技术
Delaram Talaashrafi, J. Doerfert, M. M. Maza
{"title":"A Pipeline Pattern Detection Technique in Polly","authors":"Delaram Talaashrafi, J. Doerfert, M. M. Maza","doi":"10.1145/3547276.3548445","DOIUrl":"https://doi.org/10.1145/3547276.3548445","url":null,"abstract":"The polyhedral model has repeatedly shown how it facilitates various loop transformations, including loop parallelization, loop tiling, and software pipelining. However, parallelism is almost exclusively exploited on a per-loop basis without much work on detecting cross-loop parallelization opportunities. While many problems can be scheduled such that loop dimensions are dependence-free, the resulting loop parallelism does not necessarily maximize concurrent execution, especially not for unbalanced problems. In this work, we introduce a polyhedral-model-based analysis and scheduling algorithm that exposes and utilizes cross-loop parallelization through tasking. This work exploits pipeline patterns between iterations in different loop nests, and it is well suited to handle imbalanced iterations. Our LLVM/Polly-based prototype performs schedule modifications and code generation targeting a minimal, language agnostic tasking layer. We present results using an implementation of this API with the OpenMP task construct. For different computation patterns, we achieved speed-ups of up to 3.5 × on a quad-core processor while LLVM/Polly alone fails to exploit the parallelism.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131955309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Frequency Recovery in Power Grids using High-Performance Computing 基于高性能计算的电网频率恢复
Vishwas Rao, A. Subramanyam, Michel Schanen, Youngdae Kim, Ignas Šatkauskas, M. Anitescu
{"title":"Frequency Recovery in Power Grids using High-Performance Computing","authors":"Vishwas Rao, A. Subramanyam, Michel Schanen, Youngdae Kim, Ignas Šatkauskas, M. Anitescu","doi":"10.1145/3547276.3548632","DOIUrl":"https://doi.org/10.1145/3547276.3548632","url":null,"abstract":"Maintaining electric power system stability is paramount, especially in extreme contingencies involving unexpected outages of multiple generators or transmission lines that are typical during severe weather events. Such outages often lead to large supply-demand mismatches followed by subsequent system frequency deviations from their nominal value. The extent of frequency deviations is an important metric of system resilience, and its timely mitigation is a central goal of power system operation and control. This paper develops a novel nonlinear model predictive control (NMPC) method to minimize frequency deviations when the grid is affected by an unforeseen loss of multiple components. Our method is based on a novel multi-period alternating current optimal power flow (ACOPF) formulation that accurately models both nonlinear electric power flow physics and the primary and secondary frequency response of generator control mechanisms. We develop a distributed parallel Julia package for solving the large-scale nonlinear optimization problems that result from our NMPC method and thereby address realistic test instances on existing high-performance computing architectures. Our method demonstrates superior performance in terms of frequency recovery over existing industry practices, where generator levels are set based on the solution of single-period classical ACOPF models.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127699588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信