{"title":"Learning to Parallelize in a Shared-Memory Environment with Transformers","authors":"Re'em Harel, Yuval Pinter, Gal Oren","doi":"10.1145/3572848.3582565","DOIUrl":"https://doi.org/10.1145/3572848.3582565","url":null,"abstract":"In past years, the world has switched to multi and many core shared memory architectures. As a result, there is a growing need to utilize these architectures by introducing shared memory parallelization schemes, such as OpenMP, to applications. Nevertheless, introducing OpenMP work-sharing loop construct into code, especially legacy code, is challenging due to pervasive pitfalls in management of parallel shared memory. To facilitate the performance of this task, many source-to-source (S2S) compilers have been created over the years, tasked with inserting OpenMP directives into code automatically. In addition to having limited robustness to their input format, these compilers still do not achieve satisfactory coverage and precision in locating parallelizable code and generating appropriate directives. In this work, we propose leveraging recent advances in machine learning techniques, specifically in natural language processing (NLP), to suggest the need for an OpenMP work-sharing loop directive and data-sharing attributes clauses --- the building blocks of concurrent programming. We train several transformer models, named PragFormer, for these tasks and show that they outperform statistically-trained baselines and automatic source-to-source (S2S) parallelization compilers in both classifying the overall need for an parallel for directive and the introduction of private and reduction clauses. In the future, our corpus can be used for additional tasks, up to generating entire OpenMP directives. The source code and database for our project can be accessed on GitHub 1 and HuggingFace 2.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114316358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rati Gelashvili, A. Spiegelman, Zhuolun Xiang, G. Danezis, Zekun Li, Yu Xia, Runtian Zhou, D. Malkhi
{"title":"Block-STM: Scaling Blockchain Execution by Turning Ordering Curse to a Performance Blessing","authors":"Rati Gelashvili, A. Spiegelman, Zhuolun Xiang, G. Danezis, Zekun Li, Yu Xia, Runtian Zhou, D. Malkhi","doi":"10.1145/3572848.3577524","DOIUrl":"https://doi.org/10.1145/3572848.3577524","url":null,"abstract":"Block-STM is a parallel execution engine for smart contracts, built around the principles of Software Transactional Memory. Transactions are grouped in blocks, and every execution of the block must yield the same deterministic outcome. Block-STM further enforces that the outcome is consistent with executing transactions according to a preset order, leveraging this order to dynamically detect dependencies and avoid conflicts during speculative transaction execution. At the core of Block-STM is a novel, low-overhead collaborative scheduler of execution and validation tasks. Block-STM is implemented on the main branch of the Diem Blockchain code-base and runs in production at Aptos. Our evaluation demonstrates that Block-STM is adaptive to workloads with different conflict rates and utilizes the inherent parallelism therein. Block-STM achieves up to 110k tps in the Diem benchmarks and up to 170k tps in the Aptos Benchmarks, which is a 20x and 17x improvement over the sequential baseline with 32 threads, respectively. The throughput on a contended workload is up to 50k tps and 80k tps in Diem and Aptos benchmarks, respectively.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121348781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism","authors":"Zhaodong Chen, Yuying Quan, Zheng Qu, L. Liu, Yufei Ding, Yuan Xie","doi":"10.1145/3572848.3577500","DOIUrl":"https://doi.org/10.1145/3572848.3577500","url":null,"abstract":"Transformers are becoming the mainstream solutions for various tasks like NLP and Computer vision. Despite their success, the high complexity of the attention mechanism hinders them from being applied to latency-sensitive tasks. One opportunity to accelerate the attention mechanism is leveraging the sparsity in the attention weight matrix. However, due to the dilemma between \"dynamic\" and \"fine-grained\", previous studies fail to achieve speedup on GPUs under moderate sequence lengths. They also require costly retraining to recover accuracy. In this paper, we present DFSS, the first GPU-friendly dynamic fine-grained pruning mechanism, to address this dilemma. DFSS dynamically prunes the full attention score matrix to N:M fine-grained structured sparse pattern. Our key insight is that on the dynamic side, N:M sparsity is friendly to pruning and encoding the sparse matrix on GPU. On the fine-grained side, it always preserves the dominant entries in each row. We develop a dynamic sampled dense-dense matrix multiplication kernel, first of its kind, that multiplies the query and key matrices, prunes the result, and encodes the compressed sparse matrix without overhead. Compared with previous studies, DFSS achieves speedup in arbitrary sequence lengths. It only takes a few fine-tuning epochs to reach on-par accuracy with full attention mechanism. We provide both theoretical and empirical evidence to demonstrate DFSS is a good approximation of the full attention mechanism. We evaluate the 1:2 and 2:4 sparsity under different settings and achieve 1.38 ~ 1.86× speedups over the full-attention on A100 GPU. On tasks from various domains with sequence lengths from 384 to 4096, its accuracy is on par with the full attention after only a couple of finetuning epochs from the dense pre-trained model.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133209356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","authors":"","doi":"10.1145/3572848","DOIUrl":"https://doi.org/10.1145/3572848","url":null,"abstract":"","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123184726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}