Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming最新文献

Elastic Averaging for Efficient Pipelined DNN Training 高效管道DNN训练的弹性平均

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577484

Zihao Chen, Chen Xu, Weining Qian, Aoying Zhou

{"title":"Elastic Averaging for Efficient Pipelined DNN Training","authors":"Zihao Chen, Chen Xu, Weining Qian, Aoying Zhou","doi":"10.1145/3572848.3577484","DOIUrl":"https://doi.org/10.1145/3572848.3577484","url":null,"abstract":"Nowadays, the size of DNN models has grown rapidly. To train a large model, pipeline parallelism-based frameworks partition the model across GPUs and slice each batch of data into multiple micro-batches. However, pipeline parallelism suffers from a bubble issue and low peak utilization of GPUs. Recent work tries to address the two issues, but fails to exploit the benefit of vanilla pipeline parallelism, i.e., overlapping communication with computation. In this work, we employ an elastic averaging-based framework which explores elastic averaging to add multiple parallel pipelines. To help the framework exploit the advantage of pipeline parallelism while reducing the memory footprints, we propose a schedule, advance forward propagation. Moreover, since the numbers of parallel pipelines and micro-batches are essential to the framework performance, we propose a profiling-based tuning method to automatically determine the settings. We integrate those techniques into a prototype system, namely AvgPipe, based on PyTorch. Our experiments show that Avg-Pipe achieves a 1.7x speedups over state-of-the-art solutions of pipeline parallelism on average.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"256 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132610761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Fast Symmetric Eigenvalue Decomposition via WY Representation on Tensor Core 张量核上WY表示的快速对称特征值分解

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577516

Shaoshuai Zhang, Ruchi Shah, Hiroyuki Ootomo, Rio Yokota, Panruo Wu

引用次数: 0

TL4x

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577495

Gal Assa, Andreia Correia, Pedro Ramalhete, V. Schiavoni, P. Felber

{"title":"TL4x","authors":"Gal Assa, Andreia Correia, Pedro Ramalhete, V. Schiavoni, P. Felber","doi":"10.1145/3572848.3577495","DOIUrl":"https://doi.org/10.1145/3572848.3577495","url":null,"abstract":"The arrival of persistent memory devices to consumer market has revived the interest in transactional durable algorithms. Persistent memory (PM) is touted as having two attributes that distinguish it from other storage technologies: byte-addressability and fast transactional persistence. In this work we investigate how these attributes differentiate PM from block storage in the context of buffered durability. We present a novel algorithm, TL4x, capable of providing buffered durable linearizable transactions with high scalability for disjoint writes and efficient persistence on either PM or block storage devices. TL4x is a software-only user-space solution that optimizes writes to persistent storage, providing buffered durable transactions whose cost is negligible compared to similar non-durable transactions. TL4x maintains a volatile consistent snapshot which is used for buffered durability and shared with irrevocable read-only transactions, allowing long range-query operations to run in parallel with write transactions. We use TL4x to implement a transactional database engine that can outperform RocksDB by an order of magnitude.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122429367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generating Fast FFT Kernels on CPUs via FFT-Specific Intrinsics 通过FFT特性在cpu上生成快速FFT内核

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577477

Zhihao Li, Haipeng Jia, Yunquan Zhang, Yuyan Sun, Yiwei Zhang, Tun Chen

引用次数: 0

iQAN

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1163/2330-4804_eiro_com_3472

Zhen Peng, Minjia Zhang, K. Li, R. Jin, Bin Ren

引用次数: 0

OpenCilk OpenCilk

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577509

T. Schardl, I. Lee

{"title":"OpenCilk","authors":"T. Schardl, I. Lee","doi":"10.1145/3572848.3577509","DOIUrl":"https://doi.org/10.1145/3572848.3577509","url":null,"abstract":"This paper presents OpenCilk, an open-source software infrastructure for task-parallel programming that allows for substantial code reuse and easy exploration of design choices in language abstraction, compilation strategy, runtime mechanism, and productivity-tool development. The OpenCilk infrastructure consists of three main components: a compiler designed to compile fork-join task-parallel code, an efficient work-stealing runtime scheduler, and a productivity-tool development framework based on compiler instrumentation designed for fork-join parallel computations. OpenCilk is modular --- modifying one component for the most part does not necessitate modifications to the other components --- and easy to extend --- its construction naturally encourages code reuse. Despite being modular and easy to extend, OpenCilk produces high-performing code. We investigated OpenCilk's modularity, extensibility, and performance through several case studies, including a study to extend OpenCilk to support multiple parallel runtime systems, including Cilk Plus, OpenMP, and oneTBB. OpenCilk's design enables rapid prototyping of new compiler back ends to target different parallel-runtime ABIs. Each back end required fewer than 2000 new lines of code. We examined the OpenCilk runtime's performance empirically on 15 benchmark Cilk programs and found that it outperforms the other runtimes by a geometric mean of 4%--26% on 1 core and 10%--120% on 48 cores.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123023705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

High-Throughput GPU Random Walk with Fine-Tuned Concurrent Query Processing 高吞吐量GPU随机漫步与微调并发查询处理

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577482

Cheng Xu, Chao Li, Pengyu Wang, Xiaofeng Hou, Jing Wang, Shixuan Sun, Minyi Guo, Hanqing Wu, Dongbai Chen, Xiang-Yi Liu

引用次数: 0

Merchandiser: Data Placement on Heterogeneous Memory for Task-Parallel HPC Applications with Load-Balance Awareness 具有负载平衡意识的任务并行HPC应用程序在异构内存上的数据放置

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577497

Zhen Xie, Jie Liu, Jiajia Li, Dong Li

{"title":"Merchandiser: Data Placement on Heterogeneous Memory for Task-Parallel HPC Applications with Load-Balance Awareness","authors":"Zhen Xie, Jie Liu, Jiajia Li, Dong Li","doi":"10.1145/3572848.3577497","DOIUrl":"https://doi.org/10.1145/3572848.3577497","url":null,"abstract":"The emergence of heterogeneous memory (HM) provides a cost-effective and high-performance solution to memory-consuming HPC applications. Deciding the placement of data objects on HM is critical for high performance. We reveal a performance problem related to data placement on HM. The problem is manifested as load imbalance among tasks in task-parallel HPC applications. The root of the problem comes from being unaware of parallel-task semantics and an incorrect assumption that bringing frequently accessed pages to fast memory always leads to better performance. To address this problem, we introduce a load balance-aware page management system, named Merchandiser. Merchandiser introduces task semantics during memory profiling, rather than being application-agnostic. Using the limited task semantics, Merchandiser effectively sets up coordination among tasks on the usage of HM to finish all tasks fast instead of only considering any individual task. Merchandiser is highly automated to enable high usability. Evaluating with memory-consuming HPC applications, we show that Merchandiser reduces load imbalance and leads to an average of 17.1% and 15.4% (up to 26.0% and 23.2%) performance improvement, compared with a hardware-based solution and an industry-quality software-based solution.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123939675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Visibility Algorithms for Dynamic Dependence Analysis and Distributed Coherence 动态依赖分析和分布式相干的可见性算法

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577515

Michael A. Bauer, Elliott Slaughter, Sean Treichler, Wonchan Lee, M. Garland, A. Aiken

引用次数: 0

Boosting Performance and QoS for Concurrent GPU B+trees by Combining-Based Synchronization 基于组合同步提高并发GPU B+树的性能和QoS

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577474

Weihua Zhang, Chuanlei Zhao, Lu Peng, Yuzhe Lin, Fengzhe Zhang, Yunping Lu

{"title":"Boosting Performance and QoS for Concurrent GPU B+trees by Combining-Based Synchronization","authors":"Weihua Zhang, Chuanlei Zhao, Lu Peng, Yuzhe Lin, Fengzhe Zhang, Yunping Lu","doi":"10.1145/3572848.3577474","DOIUrl":"https://doi.org/10.1145/3572848.3577474","url":null,"abstract":"Concurrent B+trees have been widely used in many systems. With the scale of data requests increasing exponentially, the systems are facing tremendous performance pressure. GPU has shown its potential to accelerate concurrent B+trees performance. When many concurrent requests are processed, the conflicts should be detected and resolved. Prior methods guarantee the correctness of concurrent GPU B+trees through lock-based or software transactional memory (STM)-based approaches. However, these methods complicate the request processing logic, increase the number of memory accesses and bring execution path divergence. They lead to performance degradation and variance in response time increasing. Moreover, previous methods do not guarantee linearizability among concurrent requests. In this paper, we design a combined-based concurrency control framework, called Eirene, for GPU B+tree to reduce the overhead of conflict detection and resolution. First, a combining-based synchronization method is designed to combine and issue requests. It combines the requests with the same key, constructs their dependence, decides the issued request, and determines their return values. Since only one request for each key is issued, key conflicts are eliminated. Then, an optimistic STM method is used to reduce structure conflicts. The query and the update requests are partitioned into different kernels. For the update kernels, STM is involved only when the number of the retry reaches a threshold. Finally, a locality-aware warp reorganization optimization is proposed to improve memory behavior and reduce conflicts by exploiting the locality among requests. Evaluations on an NVIDIA A100 GPU show that Eirene is efficient (a throughput of 2.4 billion per second) and can guarantee linearizability. Compared to the state-of-the-art GPU B+tree, it can achieve a speedup of 7.43X and reduce the response time variance from 36% to 5%.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128232533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0