Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming最新文献

筛选
英文 中文
DisCVar: discovering critical variables using algorithmic differentiation for transient faults DisCVar:利用算法微分发现瞬态故障的关键变量
Harshitha Menon, K. Mohror
{"title":"DisCVar: discovering critical variables using algorithmic differentiation for transient faults","authors":"Harshitha Menon, K. Mohror","doi":"10.1145/3178487.3178502","DOIUrl":"https://doi.org/10.1145/3178487.3178502","url":null,"abstract":"Aggressive technology scaling trends have made the hardware of high performance computing (HPC) systems more susceptible to faults. Some of these faults can lead to silent data corruption (SDC), and represent a serious problem because they alter the HPC simulation results. In this paper, we present a full-coverage, systematic methodology called DisCVar to identify critical variables in HPC applications for protection against SDC. DisCVar uses automatic differentiation (AD) to determine the sensitivity of the simulation output to errors in program variables. We empirically validate our approach in identifying vulnerable variables by comparing the results against a full-coverage code-level fault injection campaign. We find that our DisCVar correctly identifies the variables that are critical to ensure application SDC resilience with a high degree of accuracy compared to the results of the fault injection campaign. Additionally, DisCVar requires only two executions of the target program to generate results, whereas in our experiments we needed to perform millions of executions to get the same information from a fault injection campaign.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"663 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134251818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Reducing transaction aborts by looking to the future 通过展望未来来减少事务中止
Nachshon Cohen, E. Petrank, J. Larus
{"title":"Reducing transaction aborts by looking to the future","authors":"Nachshon Cohen, E. Petrank, J. Larus","doi":"10.1145/3178487.3178518","DOIUrl":"https://doi.org/10.1145/3178487.3178518","url":null,"abstract":"Transactions are widely used in database engines and they becoming increasingly useful as a general synchronization technique for multicore machines [1]. Transactional systems allow a programmer to encapsulate multiple operations inside a transaction. All these operations appear to be executed atomically or not at all.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"220 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133748752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Shared-memory parallelization of MTTKRP for dense tensors 稠密张量MTTKRP的共享内存并行化
Koby Hayashi, Grey Ballard, Yujie Jiang, Michael J. Tobia
{"title":"Shared-memory parallelization of MTTKRP for dense tensors","authors":"Koby Hayashi, Grey Ballard, Yujie Jiang, Michael J. Tobia","doi":"10.1145/3178487.3178522","DOIUrl":"https://doi.org/10.1145/3178487.3178522","url":null,"abstract":"The matricized-tensor times Khatri-Rao product (MTTKRP) is the computational bottleneck for algorithms computing CP decompositions of tensors. In this work, we develop shared-memory parallel algorithms for MTTKRP involving dense tensors. The algorithms cast nearly all of the computation as matrix operations in order to use optimized BLAS subroutines, and they avoid reordering tensor entries in memory. We use our parallel implementation to compute a CP decomposition of a neuroimaging data set and achieve a speedup of up to 7.4X over existing parallel software.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124152539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Lazygraph: lazy data coherency for replicas in distributed graph-parallel computation 懒图:分布式图并行计算中副本的懒数据一致性
Lei Wang, Liangji Zhuang, Junhang Chen, Huimin Cui, Fang Lv, Y. Liu, Xiaobing Feng
{"title":"Lazygraph: lazy data coherency for replicas in distributed graph-parallel computation","authors":"Lei Wang, Liangji Zhuang, Junhang Chen, Huimin Cui, Fang Lv, Y. Liu, Xiaobing Feng","doi":"10.1145/3178487.3178508","DOIUrl":"https://doi.org/10.1145/3178487.3178508","url":null,"abstract":"Replicas 1 of a vertex play an important role in existing distributed graph processing systems which make a single vertex to be parallel processed by multiple machines and access remote neighbors locally without any remote access. However, replicas of vertices introduce data coherency problem. Existing distributed graph systems treat replicas of a vertex v as an atomic and indivisible vertex, and use an eager data coherency approach to guarantee replicas atomicity. In eager data coherency approach, any changes to vertex data must be immediately communicated to all replicas of v, thus leading to frequent global synchronizations and communications. In this paper, we propose a lazy data coherency approach, called LazyAsync, which treats replicas of a vertex as independent vertices and maintains the data coherency by computations, rather than communications in existing eager approach. Our approach automatically selects some data coherency points from the graph algorithm, and maintains all replicas to share the same global view only at such points, which means the replicas are enabled to maintain different local views between any two adjacent data coherency points. Based on PowerGraph, we develop a distributed graph processing system LazyGraph to implement the LazyAsync approach and exploit graph-aware optimizations. On a 48-node EC2-like cluster, LazyGraph outperforms PowerGraph on four widely used graph algorithms across a variety of real-world graphs, with a speedup ranging from 1.25x to 10.69x.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131101522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Practical concurrent traversals in search trees 搜索树中的实际并发遍历
Dana Drachsler-Cohen, Martin T. Vechev, Eran Yahav
{"title":"Practical concurrent traversals in search trees","authors":"Dana Drachsler-Cohen, Martin T. Vechev, Eran Yahav","doi":"10.1145/3178487.3178503","DOIUrl":"https://doi.org/10.1145/3178487.3178503","url":null,"abstract":"Operations of concurrent objects often employ optimistic concurrency-control schemes that consist of a traversal followed by a validation step. The validation checks if concurrent mutations interfered with the traversal to determine if the operation should proceed or restart. A fundamental challenge is to discover a necessary and sufficient validation check that has to be performed to guarantee correctness. In this paper, we show a necessary and sufficient condition for validating traversals in search trees. The condition relies on a new concept of succinct path snapshots, which are derived from and embedded in the structure of the tree. We leverage the condition to design a general lock-free membership test suitable for any search tree. We then show how to integrate the validation condition in update operations of (non-rebalancing) binary search trees, internal and external, and AVL trees. We experimentally show that our new algorithms outperform existing ones.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125133138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Automated code acceleration targeting heterogeneous openCL devices 针对异构openCL设备的自动代码加速
Heinrich Riebler, G. Vaz, Tobias Kenter, Christian Plessl
{"title":"Automated code acceleration targeting heterogeneous openCL devices","authors":"Heinrich Riebler, G. Vaz, Tobias Kenter, Christian Plessl","doi":"10.1145/3178487.3178534","DOIUrl":"https://doi.org/10.1145/3178487.3178534","url":null,"abstract":"Accelerators can offer exceptional performance advantages. However, programmers need to spend considerable efforts on acceleration, without knowing how sustainable the employed programming models, languages and tools are. To tackle this challenge, we propose and demonstrate a new runtime system called HTrOP that is able to automatically generate and execute OpenCL code from sequential CPU code. HTrOP transforms suitable data-parallel loops into independent OpenCL-typical work-items and handles concrete calls to these devices through a mix of library components and application-specific OpenCL host code. Computational hotspots are identified and can be offloaded to different resources (CPU, GPGPU and Xeon Phi). We demonstrate the potential of HTrOP on a broad set of applications and are able to improve the performance by 4.3X on average.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122212613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Making pull-based graph processing performant 使基于拉的图形处理性能
Samuel Grossman, Heiner Litz, C. Kozyrakis
{"title":"Making pull-based graph processing performant","authors":"Samuel Grossman, Heiner Litz, C. Kozyrakis","doi":"10.1145/3178487.3178506","DOIUrl":"https://doi.org/10.1145/3178487.3178506","url":null,"abstract":"Graph processing engines following either the push-based or pull-based pattern conceptually consist of a two-level nested loop structure. Parallelizing and vectorizing these loops is critical for high overall performance and memory bandwidth utilization. Outer loop parallelization is simple for both engine types but suffers from high load imbalance. This work focuses on inner loop parallelization for pull engines, which when performed naively leads to a significant increase in conflicting memory writes that must be synchronized. Our first contribution is a scheduler-aware interface for parallel loops that allows us to optimize for the common case in which each thread executes several consecutive iterations. This eliminates most write traffic and avoids all synchronization, leading to speedups of up to 50X. Our second contribution is the Vector-Sparse format, which addresses the obstacles to vectorization that stem from the commonly-used Compressed-Sparse data structure. Our new format eliminates unaligned memory accesses and bounds checks within vector operations, two common problems when processing low-degree vertices. Vectorization with Vector-Sparse leads to speedups of up to 2.5X. Our contributions are embodied in Grazelle, a hybrid graph processing framework. On a server equipped with four Intel Xeon E7-4850 v3 processors, Grazelle respectively outperforms Ligra, Polymer, GraphMat, and X-Stream by up to 15.2X, 4.6X, 4.7X, and 66.8X.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130354390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
A microbenchmark to study GPU performance models 一个研究GPU性能模型的微基准
V. Volkov
{"title":"A microbenchmark to study GPU performance models","authors":"V. Volkov","doi":"10.1145/3178487.3178536","DOIUrl":"https://doi.org/10.1145/3178487.3178536","url":null,"abstract":"Basic microarchitectural features of NVIDIA GPUs have been stable for a decade, and many analytic solutions were proposed to model their performance. We present a way to review, systematize, and evaluate these approaches by using a microbenchmark. In this manner, we produce a brief algebraic summary of key elements of selected performance models, identify patterns in their design, and highlight their previously unknown limitations. Also, we identify a potentially superior method for estimating performance based on classical work.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134499721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
VerifiedFT: a verified, high-performance precise dynamic race detector VerifiedFT:一个经过验证的高性能精确动态竞赛检测器
James R. Wilcox, C. Flanagan, Stephen N. Freund
{"title":"VerifiedFT: a verified, high-performance precise dynamic race detector","authors":"James R. Wilcox, C. Flanagan, Stephen N. Freund","doi":"10.1145/3178487.3178514","DOIUrl":"https://doi.org/10.1145/3178487.3178514","url":null,"abstract":"Dynamic data race detectors are valuable tools for testing and validating concurrent software, but to achieve good performance they are typically implemented using sophisticated concurrent algorithms. Thus, they are ironically prone to the exact same kind of concurrency bugs they are designed to detect. To address these problems, we have developed VerifiedFT, a clean slate redesign of the FastTrack race detector [19]. The VerifiedFT analysis provides the same precision guarantee as FastTrack, but is simpler to implement correctly and efficiently, enabling us to mechanically verify an implementation of its core algorithm using CIVL [27]. Moreover, VerifiedFT provides these correctness guarantees without sacrificing any performance over current state-of-the-art (but complex and unverified) FastTrack implementations for Java.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123048820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Graph partitioning applied to DAG scheduling to reduce NUMA effects 图分区应用于DAG调度,减少NUMA效应
Isaac Sánchez Barrera, Marc Casas, Miquel Moretó, E. Ayguadé, Jesús Labarta, M. Valero
{"title":"Graph partitioning applied to DAG scheduling to reduce NUMA effects","authors":"Isaac Sánchez Barrera, Marc Casas, Miquel Moretó, E. Ayguadé, Jesús Labarta, M. Valero","doi":"10.1145/3178487.3178535","DOIUrl":"https://doi.org/10.1145/3178487.3178535","url":null,"abstract":"The complexity of shared memory systems is becoming more relevant as the number of memory domains increases, with different access latencies and bandwidth rates depending on the proximity between the cores and the devices containing the data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are typically applied by the system software. We propose techniques at the runtime system level to reduce NUMA effects on parallel applications. We leverage runtime system metadata in terms of a task dependency graph. Our approach, based on graph partitioning methods, is able to provide parallel performance improvements of 1.12X on average with respect to the state-of-the-art.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115516892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信