Workshop Proceedings of the 51st International Conference on Parallel Processing最新文献

筛选
英文 中文
Register-Pressure Aware Predicator for Length Multiplier of RVV RVV长度乘法器的寄存器压力感知预测器
Meng-Shiuan Shih, H.M. Lai, Chao-Lin Lee, Chung-Kai Chen, Jenq-Kuen Lee
{"title":"Register-Pressure Aware Predicator for Length Multiplier of RVV","authors":"Meng-Shiuan Shih, H.M. Lai, Chao-Lin Lee, Chung-Kai Chen, Jenq-Kuen Lee","doi":"10.1145/3547276.3548513","DOIUrl":"https://doi.org/10.1145/3547276.3548513","url":null,"abstract":"The use of parallel processing with vector processors is indispensable. The RISC-V vector extension (RVV) is a highly anticipated extension due to the demand for growing AI applications. The modularity and extensibility make RISC-V a popular instruction set in the industry. Compared to SIMD instruction, vector instructions use fewer instructions with a larger register size which can handle multiple registers within one instruction, resulting in higher performance. With the vector grouping mechanism called vector length multiplier (LMUL) provided by RVV, RVV can combine multiple vector registers into one group so that the processor can increase the throughput of processing data under the same issue rate. However, due to the register pressure, the vector length is not always positively relative to the performance. Therefore, in this paper, we develop an LMUL predicator with register-pressure-aware models to accurately assign the proper LMUL for different programs. The algorithm is based on a priority-based register allocation algorithm and considers the cost of the register pressures and program use patterns. This design helps assign the proper vector length multiplier in compile time for RVV. The experiment result shows that, with a total of 76 vectorization cases of TSVC, the proposed register pressure aware length multiplier achieves 73 correct predictions of the optimal value of Length Multiplier.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130531749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DenMG: Density-Based Member Generation for Ensemble Clustering 基于密度的集成聚类成员生成
Xueqin Du, Yulin He, Philippe Fournier-Viger, J. Huang
{"title":"DenMG: Density-Based Member Generation for Ensemble Clustering","authors":"Xueqin Du, Yulin He, Philippe Fournier-Viger, J. Huang","doi":"10.1145/3547276.3548520","DOIUrl":"https://doi.org/10.1145/3547276.3548520","url":null,"abstract":"Ensemble clustering is a popular approach for identifying clusters in data, which combines results from multiple clustering algorithms to obtain more accurate and robust clusters. However, the performance of ensemble clustering algorithms greatly depends on the quality of its members. Based on this observation, this paper proposes a density-based member generation (DenMG) algorithm that selects ensemble members by considering the distribution consistency. DenMG has two main components, which split sample points from a heterocluster and merge sample points to form a homocluster, respectively. The first component estimates two probability density functions (p.d.f.s) based on an heterocluster’s sample points, and represents them using a Gaussian distribution and a Gaussian mixture model. If random numbers generated by these two p.d.f.s are deemed to have different probability distributions, the heterocluster is split into smaller clusters. The second component merges clusters that have high neighborhood densities into a homocluster. This is done using an opposite-oriented criterion that measures neighborhood density. A series of experiments were conducted to demonstrate the feasibility and effectiveness of the proposed ensemble member generation algorithm. Results show that the proposed algorithm can generate high quality ensemble members and as a result yield better clustering than five state-of-the-art ensemble clustering algorithms.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"238 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133683100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application Showcases for TVM with NeuroPilot on Mobile Devices 移动设备上的TVM与NeuroPilot的应用展示
Sheng-Yuan Cheng, Chun-Ping Chung, Robert Lai, Jenq-Kuen Lee
{"title":"Application Showcases for TVM with NeuroPilot on Mobile Devices","authors":"Sheng-Yuan Cheng, Chun-Ping Chung, Robert Lai, Jenq-Kuen Lee","doi":"10.1145/3547276.3548514","DOIUrl":"https://doi.org/10.1145/3547276.3548514","url":null,"abstract":"With the increasing demand for machine learning inference on mobile devices, more platforms are emerging to provide AI inferences on mobile devices. One of the popular ones is TVM, which is an end-to-end AI compiler. The major drawback is TVM doesn’t support all manufacturer-supplied accelerators. On the other hand, an AI solution for MediaTek’s platform, NeuroPilot, offers inference on mobile devices with high performance. Nevertheless, NeuroPilot does not support all of the common machine learning frameworks. Therefore, we want to take advantage of both sides. This way, the solution could accept a variety of machine learning frameworks, including Tensorflow, Pytorch, ONNX, and MxNet and utilize the AI accelerator from MediaTek. We adopt the TVM BYOC flow to implement the solution. In order to illustrate the ability to accept different machine learning frameworks for different tasks, we used three different models to build an application showcase in this work: the face anti-spoofing model from PyTorch, the emotion detection model from Keras, and the object detection model from Tflite. Since these models have dependencies while running inference, we propose a prototype of pipeline algorithm to improve the inference performance of the application showcase.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133645450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cygnus - World First Multihybrid Accelerated Cluster with GPU and FPGA Coupling Cygnus -世界上第一个GPU和FPGA耦合的多混合加速集群
T. Boku, N. Fujita, Ryohei Kobayashi, O. Tatebe
{"title":"Cygnus - World First Multihybrid Accelerated Cluster with GPU and FPGA Coupling","authors":"T. Boku, N. Fujita, Ryohei Kobayashi, O. Tatebe","doi":"10.1145/3547276.3548629","DOIUrl":"https://doi.org/10.1145/3547276.3548629","url":null,"abstract":"In this paper, we describe the concept, system architecture, supporting system software, and applications on our world-first supercomputer with multihybrid accelerators using GPU and FPGA coupling, named Cygnus, which runs at Center for Computational Sciences, University of Tsukuba. A special group of 32 nodes is configured as a multihybrid accelerated computing system named Albireo part although Cygnus is constructed with over 80 computation nodes as a GPU-accelerated PC cluster. Each node of the Albireo part is equipped with four NVIDIA V100 GPU cards and two Intel Stratix10 FPGA cards in addition to two sockets of Intel Xeon Gold CPU where all nodes are connected by four lanes of InfiniBand HDR100 interconnection HCA in the full bisection bandwidth of NVIDIA HDR200 switches. Beside this ordinary interconnection network, all FPGA cards in Albireo part are connected by a special 2-Dimensional Torus network with direct optical links on each FPGA for constructing a very high throughput and low latency of FPGA-centric interconnection network. To the best of our knowledge, Cygnus is the world’s first production-level PC cluster to realize multihybrid acceleration with the GPU and FPGA combination. Unlike other GPU-accelerated clusters, users can program parallel codes where each process exploits both or either of the GPU and/or FPGA devices based on the characteristics of their applications. We developed various supporting system software such as inter-FPGA network routing system, DMA engine for GPU-FPGA direct communication managed by FPGA, and multihybrid accelerated programming framework because the programming method of such a complicated system has not been standardized. Further, we developed the first real application on Cygnus for fundamental astrophysics simulation to fully utilize GPU and FPGA together for very efficient acceleration. We describe the overall concept and construction of the Cygnus cluster with a brief introduction of the several underlying hardware and software research studies that have already been published. We summarize how such a concept of GPU/FPGA coworking will usher in a new era of accelerated supercomputing.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133463304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU AMD GPU上基于原子的HIP整数和约简研究
Zheming Jin, J. Vetter
{"title":"A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU","authors":"Zheming Jin, J. Vetter","doi":"10.1145/3547276.3548627","DOIUrl":"https://doi.org/10.1145/3547276.3548627","url":null,"abstract":"Integer sum reduction is a primitive operation commonly used in scientific computing. Implementing a parallel reduction on a GPU often involves concurrent memory accesses using atomic operations and synchronization of work-items in a work-group. For a better understanding of these operations, we redesigned micro-kernels in the HIP programming language to measure the time of atomic operations over global memory, the cost of barrier synchronization, and reduction within a work-group to shared local memory using one atomic addition per work-item on a compute unit in an AMD MI100 GPU. Then, we describe the implementations of the reduction kernels with vectorized memory accesses, parameterized workload sizes, and vendor's library APIs. Our experimental results show that 1) there is a performance tradeoff between the cost of barrier synchronization and the amount of parallelism from atomic operations over shared local memory when we increase the size of a work-group. 2) a reduction kernel with vectorized memory accesses and vector data types is approximately 3% faster for the large problem size than the kernels written with the vendor's library APIs. 3) the compiler needs to assist the hardware processor with data dependency resolution at the level of instruction set architecture. 4) the power consumption of the kernel execution on the GPU fluctuates between 277 Watts and 301 Watts and the dynamic power of other GPU activities is at most 31 Watts.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133479207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallel Beam Search for Combinatorial Optimization 组合优化的平行梁搜索
Nikolaus Frohner, Jan Gmys, N. Melab, G. Raidl, E. Talbi
{"title":"Parallel Beam Search for Combinatorial Optimization","authors":"Nikolaus Frohner, Jan Gmys, N. Melab, G. Raidl, E. Talbi","doi":"10.1145/3547276.3548633","DOIUrl":"https://doi.org/10.1145/3547276.3548633","url":null,"abstract":"Inspired by the recent success of parallelized exact methods to solve difficult scheduling problems, we present a general parallel beam search framework for combinatorial optimization problems. Beam search is a constructive metaheuristic traversing a search tree layer by layer while keeping in each layer a bounded number of promising nodes to consider many partial solutions in parallel. We propose a variant which is suitable for intra-node parallelization by multithreading with data parallelism. Diversification and inter-node parallelization are combined by performing multiple randomized runs on independent workers communicating via MPI. For sufficiently large problem instances and beam widths our prototypical implementation in the JIT-compiled Julia language admits speed-ups between 30–42 × on 46 cores with uniform memory access for two difficult classical problems, namely Permutation Flow Shop Scheduling (PFSP) with flowtime objective and the Traveling Tournament Problem (TTP). This allowed us to perform large beam width runs to find 11 new best feasible solutions for 22 difficult TTP benchmark instances up to 20 teams with an average wallclock runtime of about one hour per instance.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115070212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The OpenMP Cluster Programming Model OpenMP集群编程模型
H. Yviquel, M. Pereira, E. Francesquini, G. Valarini, Gustavo Leite, Pedro Rosso, Rodrigo Ceccato, Carla Cusihualpa, Vitoria Dias, S. Rigo, Alan Souza, G. Araújo
{"title":"The OpenMP Cluster Programming Model","authors":"H. Yviquel, M. Pereira, E. Francesquini, G. Valarini, Gustavo Leite, Pedro Rosso, Rodrigo Ceccato, Carla Cusihualpa, Vitoria Dias, S. Rigo, Alan Souza, G. Araújo","doi":"10.1145/3547276.3548444","DOIUrl":"https://doi.org/10.1145/3547276.3548444","url":null,"abstract":"Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programming model for clusters. This paper introduces OpenMP Cluster (OMPC), a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP’s offloading standard to distribute annotated regions of code across the nodes of a distributed system. To achieve that it hides MPI-based data distribution and load-balancing mechanisms behind OpenMP task dependencies. Given its compliance with OpenMP, OMPC allows applications to use the same programming model to exploit intra- and inter-node parallelism, thus simplifying the development process and maintenance. We evaluated OMPC using Task Bench, a synthetic benchmark focused on task parallelism, comparing its performance against other distributed runtimes. Experimental results show that OMPC can deliver up to 1.53x and 2.43x better performance than Charm++ on CCR and scalability experiments, respectively. Experiments also show that OMPC performance weakly scales for both Task Bench and a real-world seismic imaging application.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128838926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Accelerated Computation and Tracking of AC Optimal Power Flow Solutions Using GPUs 基于gpu的交流最优潮流加速计算与跟踪
Youngdae Kim, Kibaek Kim
{"title":"Accelerated Computation and Tracking of AC Optimal Power Flow Solutions Using GPUs","authors":"Youngdae Kim, Kibaek Kim","doi":"10.1145/3547276.3548631","DOIUrl":"https://doi.org/10.1145/3547276.3548631","url":null,"abstract":"We present a scalable solution method based on an alternating direction method of multipliers and graphics processing units (GPUs) for rapidly computing and tracking a solution of alternating current optimal power flow (ACOPF) problem. Such a fast computation is particularly useful for mitigating the negative impact of frequent load and generation fluctuations on the optimal operation of a large electrical grid. To this end, we decompose a given ACOPF problem by grid components, resulting in a large number of small independent nonlinear nonconvex optimization subproblems. The computation time of these subproblems is significantly accelerated by employing the massive parallel computing capability of GPUs. In addition, the warm-start ability of our method leads to faster convergence, making the method particularly suitable for fast tracking of optimal solutions. We demonstrate the performance of our method on a 70,000 bus system by solving associated optimal power flow problems with both cold start and warm start.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125552189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Workshop Proceedings of the 51st International Conference on Parallel Processing 第51届并行处理国际会议论文集
{"title":"Workshop Proceedings of the 51st International Conference on Parallel Processing","authors":"","doi":"10.1145/3547276","DOIUrl":"https://doi.org/10.1145/3547276","url":null,"abstract":"","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131434284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信