{"title":"Elastic Averaging for Efficient Pipelined DNN Training","authors":"Zihao Chen, Chen Xu, Weining Qian, Aoying Zhou","doi":"10.1145/3572848.3577484","DOIUrl":"https://doi.org/10.1145/3572848.3577484","url":null,"abstract":"Nowadays, the size of DNN models has grown rapidly. To train a large model, pipeline parallelism-based frameworks partition the model across GPUs and slice each batch of data into multiple micro-batches. However, pipeline parallelism suffers from a bubble issue and low peak utilization of GPUs. Recent work tries to address the two issues, but fails to exploit the benefit of vanilla pipeline parallelism, i.e., overlapping communication with computation. In this work, we employ an elastic averaging-based framework which explores elastic averaging to add multiple parallel pipelines. To help the framework exploit the advantage of pipeline parallelism while reducing the memory footprints, we propose a schedule, advance forward propagation. Moreover, since the numbers of parallel pipelines and micro-batches are essential to the framework performance, we propose a profiling-based tuning method to automatically determine the settings. We integrate those techniques into a prototype system, namely AvgPipe, based on PyTorch. Our experiments show that Avg-Pipe achieves a 1.7x speedups over state-of-the-art solutions of pipeline parallelism on average.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"256 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132610761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shaoshuai Zhang, Ruchi Shah, Hiroyuki Ootomo, Rio Yokota, Panruo Wu
{"title":"Fast Symmetric Eigenvalue Decomposition via WY Representation on Tensor Core","authors":"Shaoshuai Zhang, Ruchi Shah, Hiroyuki Ootomo, Rio Yokota, Panruo Wu","doi":"10.1145/3572848.3577516","DOIUrl":"https://doi.org/10.1145/3572848.3577516","url":null,"abstract":"Symmetric eigenvalue decomposition (EVD) is a fundamental analytic and numerical tool used in many scientific areas. The state-of-the-art algorithm in terms of performance is typically the two-stage tridiagonalization method. The first stage in the two-stage tridiagonalization is called successive band reduction (SBR), which reduces a symmetric matrix to a band form, and its computational cost usually dominates. When Tensor Core (specialized matrix computational accelerator) is used to accelerate the expensive EVD, the conventional ZY-representation-based method results in suboptimal performance due to unfavorable shapes of the matrix computations. In this paper, we propose a new method that uses WY representation instead of ZY representation (see Section 3.2 for details), which can provide a better combination of locality and parallelism so as to perform better on Tensor Cores. Experimentally, the proposed method can bring up to 3.7x speedup in SBR and 2.3x in the entire EVD compared to state-of-the-art implementations.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130037878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gal Assa, Andreia Correia, Pedro Ramalhete, V. Schiavoni, P. Felber
{"title":"TL4x","authors":"Gal Assa, Andreia Correia, Pedro Ramalhete, V. Schiavoni, P. Felber","doi":"10.1145/3572848.3577495","DOIUrl":"https://doi.org/10.1145/3572848.3577495","url":null,"abstract":"The arrival of persistent memory devices to consumer market has revived the interest in transactional durable algorithms. Persistent memory (PM) is touted as having two attributes that distinguish it from other storage technologies: byte-addressability and fast transactional persistence. In this work we investigate how these attributes differentiate PM from block storage in the context of buffered durability. We present a novel algorithm, TL4x, capable of providing buffered durable linearizable transactions with high scalability for disjoint writes and efficient persistence on either PM or block storage devices. TL4x is a software-only user-space solution that optimizes writes to persistent storage, providing buffered durable transactions whose cost is negligible compared to similar non-durable transactions. TL4x maintains a volatile consistent snapshot which is used for buffered durability and shared with irrevocable read-only transactions, allowing long range-query operations to run in parallel with write transactions. We use TL4x to implement a transactional database engine that can outperform RocksDB by an order of magnitude.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122429367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generating Fast FFT Kernels on CPUs via FFT-Specific Intrinsics","authors":"Zhihao Li, Haipeng Jia, Yunquan Zhang, Yuyan Sun, Yiwei Zhang, Tun Chen","doi":"10.1145/3572848.3577477","DOIUrl":"https://doi.org/10.1145/3572848.3577477","url":null,"abstract":"This paper proposes an algorithm-specific instruction (ASI)-based fast Fourier transform (FFT) code generation framework, named FFTASI, to generate unified architecture independent butterfly kernels that can be transformed into architecture-dependent kernels by establishing the mapping between ASIs and architecture-specific instructions for various hardware platforms. FFTASI strikes a good balance between performance and productivity on CPUs.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"73 43","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134196625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"iQAN","authors":"Zhen Peng, Minjia Zhang, K. Li, R. Jin, Bin Ren","doi":"10.1163/2330-4804_eiro_com_3472","DOIUrl":"https://doi.org/10.1163/2330-4804_eiro_com_3472","url":null,"abstract":"","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116712440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OpenCilk","authors":"T. Schardl, I. Lee","doi":"10.1145/3572848.3577509","DOIUrl":"https://doi.org/10.1145/3572848.3577509","url":null,"abstract":"This paper presents OpenCilk, an open-source software infrastructure for task-parallel programming that allows for substantial code reuse and easy exploration of design choices in language abstraction, compilation strategy, runtime mechanism, and productivity-tool development. The OpenCilk infrastructure consists of three main components: a compiler designed to compile fork-join task-parallel code, an efficient work-stealing runtime scheduler, and a productivity-tool development framework based on compiler instrumentation designed for fork-join parallel computations. OpenCilk is modular --- modifying one component for the most part does not necessitate modifications to the other components --- and easy to extend --- its construction naturally encourages code reuse. Despite being modular and easy to extend, OpenCilk produces high-performing code. We investigated OpenCilk's modularity, extensibility, and performance through several case studies, including a study to extend OpenCilk to support multiple parallel runtime systems, including Cilk Plus, OpenMP, and oneTBB. OpenCilk's design enables rapid prototyping of new compiler back ends to target different parallel-runtime ABIs. Each back end required fewer than 2000 new lines of code. We examined the OpenCilk runtime's performance empirically on 15 benchmark Cilk programs and found that it outperforms the other runtimes by a geometric mean of 4%--26% on 1 core and 10%--120% on 48 cores.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123023705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Throughput GPU Random Walk with Fine-Tuned Concurrent Query Processing","authors":"Cheng Xu, Chao Li, Pengyu Wang, Xiaofeng Hou, Jing Wang, Shixuan Sun, Minyi Guo, Hanqing Wu, Dongbai Chen, Xiang-Yi Liu","doi":"10.1145/3572848.3577482","DOIUrl":"https://doi.org/10.1145/3572848.3577482","url":null,"abstract":"Random walk serves as a powerful tool in dealing with large-scale graphs, reducing data size while preserving structural information. Unfortunately, existing system frameworks all focus on the execution of a single walker task in serial. We propose CoWalker, a high-throughput GPU random walk framework tailored for concurrent random walk tasks. It introduces a multi-level concurrent execution model to allow concurrent random walk tasks to efficiently share GPU resources with low overhead. Our system prototype confirms that the proposed system could outperform (up to 54%) the state-of-the-art in a wide spectral of scenarios.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122874679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Merchandiser: Data Placement on Heterogeneous Memory for Task-Parallel HPC Applications with Load-Balance Awareness","authors":"Zhen Xie, Jie Liu, Jiajia Li, Dong Li","doi":"10.1145/3572848.3577497","DOIUrl":"https://doi.org/10.1145/3572848.3577497","url":null,"abstract":"The emergence of heterogeneous memory (HM) provides a cost-effective and high-performance solution to memory-consuming HPC applications. Deciding the placement of data objects on HM is critical for high performance. We reveal a performance problem related to data placement on HM. The problem is manifested as load imbalance among tasks in task-parallel HPC applications. The root of the problem comes from being unaware of parallel-task semantics and an incorrect assumption that bringing frequently accessed pages to fast memory always leads to better performance. To address this problem, we introduce a load balance-aware page management system, named Merchandiser. Merchandiser introduces task semantics during memory profiling, rather than being application-agnostic. Using the limited task semantics, Merchandiser effectively sets up coordination among tasks on the usage of HM to finish all tasks fast instead of only considering any individual task. Merchandiser is highly automated to enable high usability. Evaluating with memory-consuming HPC applications, we show that Merchandiser reduces load imbalance and leads to an average of 17.1% and 15.4% (up to 26.0% and 23.2%) performance improvement, compared with a hardware-based solution and an industry-quality software-based solution.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123939675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael A. Bauer, Elliott Slaughter, Sean Treichler, Wonchan Lee, M. Garland, A. Aiken
{"title":"Visibility Algorithms for Dynamic Dependence Analysis and Distributed Coherence","authors":"Michael A. Bauer, Elliott Slaughter, Sean Treichler, Wonchan Lee, M. Garland, A. Aiken","doi":"10.1145/3572848.3577515","DOIUrl":"https://doi.org/10.1145/3572848.3577515","url":null,"abstract":"Implicitly parallel programming systems must solve the joint problems of dependence analysis and coherence to ensure apparently-sequential semantics for applications run on distributed memory machines. Solving these problems in the presence of data-dependent control flow and arbitrary aliasing is a challenge that most existing systems eschew by compromising the expressivity of their programming models and/or the performance of their implementations. We demonstrate a general class of solutions to these problems via a reduction to the visibility problem from computer graphics.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127568203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weihua Zhang, Chuanlei Zhao, Lu Peng, Yuzhe Lin, Fengzhe Zhang, Yunping Lu
{"title":"Boosting Performance and QoS for Concurrent GPU B+trees by Combining-Based Synchronization","authors":"Weihua Zhang, Chuanlei Zhao, Lu Peng, Yuzhe Lin, Fengzhe Zhang, Yunping Lu","doi":"10.1145/3572848.3577474","DOIUrl":"https://doi.org/10.1145/3572848.3577474","url":null,"abstract":"Concurrent B+trees have been widely used in many systems. With the scale of data requests increasing exponentially, the systems are facing tremendous performance pressure. GPU has shown its potential to accelerate concurrent B+trees performance. When many concurrent requests are processed, the conflicts should be detected and resolved. Prior methods guarantee the correctness of concurrent GPU B+trees through lock-based or software transactional memory (STM)-based approaches. However, these methods complicate the request processing logic, increase the number of memory accesses and bring execution path divergence. They lead to performance degradation and variance in response time increasing. Moreover, previous methods do not guarantee linearizability among concurrent requests. In this paper, we design a combined-based concurrency control framework, called Eirene, for GPU B+tree to reduce the overhead of conflict detection and resolution. First, a combining-based synchronization method is designed to combine and issue requests. It combines the requests with the same key, constructs their dependence, decides the issued request, and determines their return values. Since only one request for each key is issued, key conflicts are eliminated. Then, an optimistic STM method is used to reduce structure conflicts. The query and the update requests are partitioned into different kernels. For the update kernels, STM is involved only when the number of the retry reaches a threshold. Finally, a locality-aware warp reorganization optimization is proposed to improve memory behavior and reduce conflicts by exploiting the locality among requests. Evaluations on an NVIDIA A100 GPU show that Eirene is efficient (a throughput of 2.4 billion per second) and can guarantee linearizability. Compared to the state-of-the-art GPU B+tree, it can achieve a speedup of 7.43X and reduce the response time variance from 36% to 5%.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128232533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}