{"title":"Revealing parallel scans and reductions in sequential loops through function reconstruction","authors":"Peng Jiang, G. Agrawal","doi":"10.1145/3178487.3178523","DOIUrl":"https://doi.org/10.1145/3178487.3178523","url":null,"abstract":"Many sequential loops are actually scans or reductions and can be parallelized across iterations despite the loop-carried dependences. In this work, we consider the parallelization of such scan/reduction loops, and propose a practical runtime approach called sampling-and-reconstruction to extract the hidden scan/reduction patterns in these loops.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130012961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bridging the gap between deep learning and sparse matrix format selection","authors":"Yue Zhao, Jiajia Li, C. Liao, Xipeng Shen","doi":"10.1145/3178487.3178495","DOIUrl":"https://doi.org/10.1145/3178487.3178495","url":null,"abstract":"This work presents a systematic exploration on the promise and special challenges of deep learning for sparse matrix format selection---a problem of determining the best storage format for a matrix to maximize the performance of Sparse Matrix Vector Multiplication (SpMV). It describes how to effectively bridge the gap between deep learning and the special needs of the pillar HPC problem through a set of techniques on matrix representations, deep learning structure, and cross-architecture model migrations. The new solution cuts format selection errors by two thirds, and improves SpMV performance by 1.73X on average over the state of the art.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126221030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient parallel determinacy race detection for two-dimensional dags","authors":"Yifan Xu, I. Lee, Kunal Agrawal","doi":"10.1145/3178487.3178515","DOIUrl":"https://doi.org/10.1145/3178487.3178515","url":null,"abstract":"A program is said to have a determinacy race if logically parallel parts of a program access the same memory location and one of the accesses is a write. These races are generally bugs in the program since they lead to non-deterministic program behavior --- different schedules of the program can lead to different results. Most prior work on detecting these races focuses on a subclass of programs with fork-join parallelism. This paper presents a race-detection algorithm, 2D-Order, for detecting races in a more general class of programs, namely programs whose dependence structure can be represented as planar dags embedded in 2D grids. Such dependence structures arise from programs that use pipelined parallelism or dynamic programming recurrences. Given a computation with T1 work and T∞ span, 2D-Order executes it while also detecting races in O(T1/P + T∞) time on P processors, which is asymptotically optimal. We also implemented PRacer, a race-detection algorithm based on 2D-Order for Cilk-P, which is a language for expressing pipeline parallelism. Empirical results demonstrate that PRacer incurs reasonable overhead and exhibits scalability similar to the baseline (executions without race detection) when running on multiple cores.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128275969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A predictable synchronisation algorithm","authors":"S. Reif, Wolfgang Schröder-Preikschat","doi":"10.1145/3178487.3178533","DOIUrl":"https://doi.org/10.1145/3178487.3178533","url":null,"abstract":"Interaction with physical objects often imposes latency requirements to multi-core embedded systems. One consequence is the need for synchronisation algorithms that provide predictable latency, in addition to high throughput. We present a synchronisation algorithm that needs at most 7 atomic memory operations per asynchronous critical section. The performance is competitive, at least, to locks.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129158137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures","authors":"Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu","doi":"10.1145/3178487.3178513","DOIUrl":"https://doi.org/10.1145/3178487.3178513","url":null,"abstract":"Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the out-of-control data reuse and high cost for global memory or shared cache access in inter-level synchronization have been largely neglected in existing work. In this paper, we propose a novel data layout called Sparse Level Tile to make all data reuse under control, and design a Producer-Consumer pairing method to make any inter-level synchronization only happen in very fast register communication. We implement our data layout and algorithms on an SW26010 many-core processor, which is the main building-block of the current world fastest supercomputer Sunway Taihulight. The experimental results of testing all 2057 square matrices from the Florida Matrix Collection show that our method achieves an average speedup of 6.9 and the best speedup of 38.5 over parallel level-set method. Our method also outperforms the latest methods on a KNC many-core processor in 1856 matrices and the latest methods on a K80 GPU in 1672 matrices, respectively.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131969433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SecureMR: secure mapreduce using homomorphic encryption and program partitioning","authors":"Yao Dong, Ana L. Milanova, Julian T Dolby","doi":"10.1145/3178487.3178520","DOIUrl":"https://doi.org/10.1145/3178487.3178520","url":null,"abstract":"In cloud computing customers upload data and computation to cloud providers. As they cede their data to the cloud provider, they may cede data confidentiality. We develop SecureMR, a system that analyzes and transforms MapReduce programs to operate over encrypted data. SecureMR makes use of partially homomorphic encryption and a trusted client. We evaluate SecureMR on a set of MapReduce benchmarks.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131586432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changwan Hong, Aravind Sukumaran-Rajam, Jinsung Kim, P. Rawat, S. Krishnamoorthy, L. Pouchet, F. Rastello, P. Sadayappan
{"title":"Performance modeling for GPUs using abstract kernel emulation","authors":"Changwan Hong, Aravind Sukumaran-Rajam, Jinsung Kim, P. Rawat, S. Krishnamoorthy, L. Pouchet, F. Rastello, P. Sadayappan","doi":"10.1145/3178487.3178524","DOIUrl":"https://doi.org/10.1145/3178487.3178524","url":null,"abstract":"Performance modeling of GPU kernels is a significant challenge. In this paper, we develop a novel approach to performance modeling for GPUs through abstract kernel emulation along with latency/gap modeling of resources. Experimental results on all benchmarks from the Rodinia suite demonstrate good accuracy in predicting execution time on multiple GPU platforms.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121237667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Rawat, F. Rastello, Aravind Sukumaran-Rajam, L. Pouchet, A. Rountev, P. Sadayappan
{"title":"Register optimizations for stencils on GPUs","authors":"P. Rawat, F. Rastello, Aravind Sukumaran-Rajam, L. Pouchet, A. Rountev, P. Sadayappan","doi":"10.1145/3178487.3178500","DOIUrl":"https://doi.org/10.1145/3178487.3178500","url":null,"abstract":"The recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy. A common optimization strategy for such stencils is to expose sufficient data reuse by means such as loop unrolling, with the expectation of register-level reuse. However, the resulting code is often highly constrained by register pressure. While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in sub-optimal code with a large number of register spills. In this paper, we develop a statement reordering framework that models stencil computations as a DAG of trees with shared leaves, and adapts an optimal scheduling algorithm for minimizing register usage for expression trees. The effectiveness of the approach is demonstrated through experimental results on a range of stencils extracted from application codes.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125786327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"vSensor: leveraging fixed-workload snippets of programs for performance variance detection","authors":"Xiongchao Tang, Jidong Zhai, Xuehai Qian, Bingsheng He, Wei Xue, Wenguang Chen","doi":"10.1145/3178487.3178497","DOIUrl":"https://doi.org/10.1145/3178487.3178497","url":null,"abstract":"Performance variance becomes increasingly challenging on current large-scale HPC systems. Even using a fixed number of computing nodes, the execution time of several runs can vary significantly. Many parallel programs executing on supercomputers suffer from such variance. Performance variance not only causes unpredictable performance requirement violations, but also makes it unintuitive to understand the program behavior. Despite prior efforts, efficient on-line detection of performance variance remains an open problem. In this paper, we propose vSensor, a novel approach for light-weight and on-line performance variance detection. The key insight is that, instead of solely relying on an external detector, the source code of a program itself could reveal the runtime performance characteristics. Specifically, many parallel programs contain code snippets that are executed repeatedly with an invariant quantity of work. Based on this observation, we use compiler techniques to automatically identify these fixed-workload snippets and use them as performance variance sensors (v-sensors) that enable effective detection. We evaluate vSensor with a variety of parallel programs on the Tianhe-2 system. Results show that vSensor can effectively detect performance variance on HPC systems. The performance overhead is smaller than 4% with up to 16,384 processes. In particular, with vSensor, we found a bad node with slow memory that slowed a program's performance by 21%. As a showcase, we also detected a severe network performance problem that caused a 3.37X slowdown for an HPC kernel program on the Tianhe-2 system.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127710820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Harnessing epoch-based reclamation for efficient range queries","authors":"Maya Arbel-Raviv, Trevor Brown","doi":"10.1145/3178487.3178489","DOIUrl":"https://doi.org/10.1145/3178487.3178489","url":null,"abstract":"Concurrent sets with range query operations are highly desirable in applications such as in-memory databases. However, few set implementations offer range queries. Known techniques for augmenting data structures with range queries (or operations that can be used to build range queries) have numerous problems that limit their usefulness. For example, they impose high overhead or rely heavily on garbage collection. In this work, we show how to augment data structures with highly efficient range queries, without relying on garbage collection. We identify a property of epoch-based memory reclamation algorithms that makes them ideal for implementing range queries, and produce three algorithms, which use locks, transactional memory and lock-free techniques, respectively. Our algorithms are applicable to more data structures than previous work, and are shown to be highly efficient on a large scale Intel system.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"28 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132532372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}