{"title":"The Support of MLIR HLS Adaptor for LLVM IR","authors":"Geng-Ming Liang, Chuan-Yue Yuan, Meng-Shiun Yuan, Tai-Liang Chen, Kuan-Hsun Chen, Jenq-Kuen Lee","doi":"10.1145/3547276.3548515","DOIUrl":"https://doi.org/10.1145/3547276.3548515","url":null,"abstract":"Since the emergence of MLIR, High-level Synthesis (HLS) tools started to design in multi-level abstractions. Unlike the traditional HLS tools that are based on a single abstraction (e.g. LLVM), optimizations in different levels of abstraction could benefit from cross-layer optimizations to get better results. Although current HLS tools with MLIR can generate HLS C/C++ to do synthesis, we believe that a direct IR transformation from MLIR to LLVM will keep more expression details. In this paper, we propose an adaptor for LLVM IR, which can optimize the IR, generated from MLIR, into HLS readable IR. Without the gap of unsupported syntax between different versions, developers could focus on their specialization. Our preliminary results show that the MLIR flow via our adaptor can generate comparable performance results with the version by MLIR HLS tools generating HLS C++ codes. The experiment is performed with Xilinx Vitis and HLS tools.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128581841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Structured Concurrency: A Review","authors":"Yi-An Chen, Yi-Ping You","doi":"10.1145/3547276.3548519","DOIUrl":"https://doi.org/10.1145/3547276.3548519","url":null,"abstract":"Today, mobile applications use thousands of concurrent tasks to process multiple sensor inputs to ensure a better user experience. With this demand, the ability to manage these concurrent tasks efficiently and easily is becoming a new challenge, especially in their lifetimes. Structured concurrency is a technique that reduces the complexity of managing a large number of concurrent tasks. There have been several languages or libraries (e.g., Kotlin, Swift, and Trio) that support such a paradigm for better concurrency management. It is worth noting that structured concurrency has been consistently implemented on top of coroutines across all these languages and libraries. However, there are no documents or studies in the literature that indicate why and how coroutines are relevant to structured concurrency. In contrast, the mainstream community views structured concurrency as a successor to structured programming; that is, the concept of “structure” extends from ordinary programming to concurrent programming. Nevertheless, such a viewpoint does not explain, as the concept of structured concurrency came out more than 40 years later after structured programming was introduced in the early 1970s, whereas concurrent programming started in the 1960s. In this paper, we introduce a new theory to complement the origin of structured concurrency from historical and technical perspectives—it is the foundation established by coroutines that gives birth to structured concurrency.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127440949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eishi Arima, Minjoon Kang, Issa Saba, J. Weidendorfer, C. Trinitis, Martin Schulz
{"title":"Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps","authors":"Eishi Arima, Minjoon Kang, Issa Saba, J. Weidendorfer, C. Trinitis, Martin Schulz","doi":"10.1145/3547276.3548630","DOIUrl":"https://doi.org/10.1145/3547276.3548630","url":null,"abstract":"CPU-GPU heterogeneous systems are now commonly used in HPC (High-Performance Computing). However, improving the utilization and energy-efficiency of such systems is still one of the most critical issues. As one single program typically cannot fully utilize all resources within a node/chip, co-scheduling (or co-locating) multiple programs with complementary resource requirements is a promising solution. Meanwhile, as power consumption has become the first-class design constraint for HPC systems, such co-scheduling techniques should be well-tailored for power-constrained environments. To this end, the industry recently started supporting hardware-level resource partitioning features on modern GPUs for realizing efficient co-scheduling, which can operate with existing power capping features. For example, NVidia’s MIG (Multi-Instance GPU) partitions one single GPU into multiple instances at the granularity of a GPC (Graphics Processing Cluster). In this paper, we explicitly target the combination of hardware-level GPU partitioning features and power capping for power-constrained HPC systems. We provide a systematic methodology to optimize the combination of chip partitioning, job allocations, as well as power capping based on our scalability/interference modeling while taking a variety of aspects into account, such as compute/memory intensity and utilization in heterogeneous computational resources (e.g., Tensor Cores). The experimental result indicates that our approach is successful in selecting a near optimal combination across multiple different workloads.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130408063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tu Tran, Benjamin Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, D. Panda
{"title":"Designing Hierarchical Multi-HCA Aware Allgather in MPI","authors":"Tu Tran, Benjamin Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, D. Panda","doi":"10.1145/3547276.3548524","DOIUrl":"https://doi.org/10.1145/3547276.3548524","url":null,"abstract":"To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, resulting in a ”multi-rail” network. The second and third-placed systems of the Top500 use two adapters per node; recently, the ThetaGPU system at Argonne National Laboratory (ANL) uses eight adapters per node. With such an availability of networking resources, it is a non-trivial task to utilize all of them. The Message Passing Interface (MPI) is a dominant model for high-performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we take up this task and propose hierarchical, multi-HCA aware Allgather designs; Allgather is a communication-intensive collective widely used in applications like matrix multiplication and other collectives. The proposed designs fully utilize all the available network adapters within a node and provides high overlap between inter-node and intra-node communication. At the micro-benchmark level, our new schemes achieve performance improvement for both single node and multiple node communication. We see inter-node improvements up to 62% and 61% better than HPC-X and MVAPICH2-X for 1024 processes. The design for inter-node communication also boosts the performance of Ring Allreduce by 56% and 44% compared to HPC-X and MVAPICH2-X. At the application level, the enhanced Allgather shows 1.98x and 1.42x improvement in a matrix-vector multiplication kernel when compared to HPC-X and MVAPICH2-X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2-X.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"180 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134311827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Execution Flow Aware Profiling for ROS-based Autonomous Vehicle Software","authors":"Shao-Hua Wang, Chia-Heng Tu, C. Huang, J. Juang","doi":"10.1145/3547276.3548516","DOIUrl":"https://doi.org/10.1145/3547276.3548516","url":null,"abstract":"The complexity of the Robot Operating System (ROS) based autonomous software grows as autonomous vehicles get more intelligent. It is a big challenge for system designers to rapidly understand runtime behaviors and performance of such sophisticated software because the conventional tools are insufficient for characterizing the high-level interactions of the modules within the software. In this paper, a new graphical representation, execution flow graph, is devised to represent the execution sequences and related performance statistics of the ROS modules. The execution flow aware profiling is applied on the autonomous software, Autoware and Navigation Stack, with encouraging results.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"664 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116099578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiangwei Shang, Zhan Zhang, Chuanyou Li, Kun Zhang, Lei Qian, Hongwei Liu
{"title":"A Software/Hardware Co-design Local Irregular Sparsity Method for Accelerating CNNs on FPGA","authors":"Jiangwei Shang, Zhan Zhang, Chuanyou Li, Kun Zhang, Lei Qian, Hongwei Liu","doi":"10.1145/3547276.3548521","DOIUrl":"https://doi.org/10.1145/3547276.3548521","url":null,"abstract":"Convolutional neural networks (CNNs) have been widely used in different areas. The success of CNNs comes with a huge amount of parameters and computations, and nowaday CNNs still keep moving toward larger structures. Although larger structures often bring about better inference accuracy, the increasing size also slows the inference speed down. Recently, various parameter sparsity methods have been proposed to accelerate CNNs by reducing the number of parameters and computations. Existing sparsity methods could be classified into two categories: unstructured and structured. Unstructured sparsity methods easily cause irregularity and thus have a suboptimal speedup. On the other hand, the structured sparsity methods could keep regularity by pruning the parameters following a certain pattern but result in low sparsity. In this paper, we propose a software/hardware co-design approach to bring local irregular sparsity into CNNs. Benefiting from the local irregularity, we design a row-wise computing engine, RConv Engine, to achieve workload balance and remarkable speedup. The experimental results show that our software/hardware co-design method can achieve a 10.9x speedup than the state-of-the-art methods with a negligible accuracy loss.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115241898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Two-Stage Pre-processing for License Recognition","authors":"J. Zhang, Cheng-Tsung Chan, Minmin Sun","doi":"10.1145/3547276.3548441","DOIUrl":"https://doi.org/10.1145/3547276.3548441","url":null,"abstract":"Various financial insurance and investment application websites require customers to upload identity documents, such as vehicle licenses, to verify their identities. Manual verification of these documents is costly. Hence, there is a clear demand for automatic document recognition. This study proposes a two-stage method to pre-process a vehicle license for a better text recognition. In the first stage, the distortion that often appears in photographed documents is repaired. In the second stage, each data field is carefully located. The subsequent captured fields are then processed by a commercial text recognition software. Due to the sensitivity of vehicle licenses, it is difficult to collect enough data for model training. Consequently, artificial vehicle licenses are synthesized for model training to mitigate overfitting. In addition, an encoder is applied to reduce the background noise, remove the border crossing over text, and make the blurred text clearer before text recognition. The proposed method on a real dataset shows that the accuracy is close to 90%.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116591890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A framework for low communication approaches for large scale 3D convolution","authors":"Anuva Kulkarni, Jelena Kovacevic, F. Franchetti","doi":"10.1145/3547276.3548626","DOIUrl":"https://doi.org/10.1145/3547276.3548626","url":null,"abstract":"Large-scale 3D convolutions computed using parallel Fast Fourier Transforms (FFTs) demand multiple all-to-all communication steps, which cause bottlenecks on computing clusters. Since data transfer speeds to/from memory have not increased proportionally to computational capacity (in terms of FLOPs), 3D FFTs become bounded by communication and are difficult to scale, especially on modern heterogeneous computing platforms consisting of accelerators like GPUs. Existing HPC frameworks focus on optimizing the isolated FFT algorithm or communication patterns, but still require multiple all-to-all communication steps during convolution. In this work, we present a strategy for scalable convolution such that it avoids multiple all-to-all exchanges, and also optimizes necessary communication. We provide proof-of-concept results under assumptions of a use case, the MASSIF Hooke’s law simulation convolution kernel. Our method localizes computation by exploiting properties of the data, and approximates the convolution result by data compression, resulting in increased scalability of 3D convolution. Our preliminary results show scalability of 8 times more than traditional methods in the same compute resources without adversely affecting result accuracy. Our method can be adapted for first-principle scientific simulations and leverages cross-disciplinary knowledge of the application, the data and computing to perform large-scale convolution while avoiding communication bottlenecks. In order to make our approach widely usable and adaptable for emerging challenges, we discuss the use of FFTX, a novel framework which can be used for platform-agnostic specification and optimization for algorithmic approaches similar to ours.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132073449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Pipeline Pattern Detection Technique in Polly","authors":"Delaram Talaashrafi, J. Doerfert, M. M. Maza","doi":"10.1145/3547276.3548445","DOIUrl":"https://doi.org/10.1145/3547276.3548445","url":null,"abstract":"The polyhedral model has repeatedly shown how it facilitates various loop transformations, including loop parallelization, loop tiling, and software pipelining. However, parallelism is almost exclusively exploited on a per-loop basis without much work on detecting cross-loop parallelization opportunities. While many problems can be scheduled such that loop dimensions are dependence-free, the resulting loop parallelism does not necessarily maximize concurrent execution, especially not for unbalanced problems. In this work, we introduce a polyhedral-model-based analysis and scheduling algorithm that exposes and utilizes cross-loop parallelization through tasking. This work exploits pipeline patterns between iterations in different loop nests, and it is well suited to handle imbalanced iterations. Our LLVM/Polly-based prototype performs schedule modifications and code generation targeting a minimal, language agnostic tasking layer. We present results using an implementation of this API with the OpenMP task construct. For different computation patterns, we achieved speed-ups of up to 3.5 × on a quad-core processor while LLVM/Polly alone fails to exploit the parallelism.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131955309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vishwas Rao, A. Subramanyam, Michel Schanen, Youngdae Kim, Ignas Šatkauskas, M. Anitescu
{"title":"Frequency Recovery in Power Grids using High-Performance Computing","authors":"Vishwas Rao, A. Subramanyam, Michel Schanen, Youngdae Kim, Ignas Šatkauskas, M. Anitescu","doi":"10.1145/3547276.3548632","DOIUrl":"https://doi.org/10.1145/3547276.3548632","url":null,"abstract":"Maintaining electric power system stability is paramount, especially in extreme contingencies involving unexpected outages of multiple generators or transmission lines that are typical during severe weather events. Such outages often lead to large supply-demand mismatches followed by subsequent system frequency deviations from their nominal value. The extent of frequency deviations is an important metric of system resilience, and its timely mitigation is a central goal of power system operation and control. This paper develops a novel nonlinear model predictive control (NMPC) method to minimize frequency deviations when the grid is affected by an unforeseen loss of multiple components. Our method is based on a novel multi-period alternating current optimal power flow (ACOPF) formulation that accurately models both nonlinear electric power flow physics and the primary and secondary frequency response of generator control mechanisms. We develop a distributed parallel Julia package for solving the large-scale nonlinear optimization problems that result from our NMPC method and thereby address realistic test instances on existing high-performance computing architectures. Our method demonstrates superior performance in terms of frequency recovery over existing industry practices, where generator levels are set based on the solution of single-period classical ACOPF models.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127699588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}