Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming最新文献

筛选
英文 中文
A scalable distance-1 vertex coloring algorithm for power-law graphs 幂律图的可伸缩距离-1顶点着色算法
J. Firoz, Marcin Zalewski, A. Lumsdaine
{"title":"A scalable distance-1 vertex coloring algorithm for power-law graphs","authors":"J. Firoz, Marcin Zalewski, A. Lumsdaine","doi":"10.1145/3178487.3178521","DOIUrl":"https://doi.org/10.1145/3178487.3178521","url":null,"abstract":"We propose a distributed, unordered, label-correcting distance-1 vertex coloring algorithm, called Distributed Control (DC) coloring algorithm. DC eliminates the need for vertex-centric barriers and global synchronization for color refinement, relying only on atomic operations and local termination detection to update vertex color. We implement our DC coloring algorithm and the well-known Jones-Plassmann algorithm in the AM++ AMT runtime and compare their performance. We show that, with runtime support, the elimination of waiting time of vertex-centric barriers and investing this time for local ordering results in better execution time for power-law graphs with dense local subgraphs.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128347898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Strong trylocks for reader-writer locks 读写锁的强尝试锁
Andreia Correia, P. Ramalhete
{"title":"Strong trylocks for reader-writer locks","authors":"Andreia Correia, P. Ramalhete","doi":"10.1145/3178487.3178519","DOIUrl":"https://doi.org/10.1145/3178487.3178519","url":null,"abstract":"A reader-writer lock provides basic methods for shared and exclusive lock acquisition. A thread calling one of these methods may have to wait indefinitely to enter its critical section, with no guarantee of completion. We present two new reader-writer strong trylock algorithms, where a call to a trylock method always completes in a finite number of steps, and is guaranteed to succeed unless there is a linearizable history for which another thread has the lock. The first algorithm, named StrongTryRW, uses a single word of memory to reach consensus, thus yielding reduced scalability for readers. To address read scalability, we designed StrongTryRWRI which matches in throughput the current state of the art reader-writer lock algorithms.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133547225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Communication-avoiding parallel minimum cuts and connected components 通信-避免平行最小切割和连接组件
Lukas Gianinazzi, Pavel Kalvoda, A. Palma, Maciej Besta, T. Hoefler
{"title":"Communication-avoiding parallel minimum cuts and connected components","authors":"Lukas Gianinazzi, Pavel Kalvoda, A. Palma, Maciej Besta, T. Hoefler","doi":"10.1145/3178487.3178504","DOIUrl":"https://doi.org/10.1145/3178487.3178504","url":null,"abstract":"We present novel scalable parallel algorithms for finding global minimum cuts and connected components, which are important and fundamental problems in graph processing. To take advantage of future massively parallel architectures, our algorithms are communication-avoiding: they reduce the costs of communication across the network and the cache hierarchy. The fundamental technique underlying our work is the randomized sparsification of a graph: removing a fraction of graph edges, deriving a solution for such a sparsified graph, and using the result to obtain a solution for the original input. We design and implement sparsification with O(1) synchronization steps. Our global minimum cut algorithm decreases communication costs and computation compared to the state-of-the-art, while our connected components algorithm incurs few cache misses and synchronization steps. We validate our approach by evaluating MPI implementations of the algorithms on a petascale supercomputer. We also provide an approximate variant of the minimum cut algorithm and show that it approximates the exact solutions well while using a fraction of cores in a fraction of time.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"2 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132214886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Optimizing N-dimensional, winograd-based convolution for manycore CPUs 为多核cpu优化n维、基于winograd的卷积
Zhen Jia, A. Zlateski, F. Durand, Kai Li
{"title":"Optimizing N-dimensional, winograd-based convolution for manycore CPUs","authors":"Zhen Jia, A. Zlateski, F. Durand, Kai Li","doi":"10.1145/3178487.3178496","DOIUrl":"https://doi.org/10.1145/3178487.3178496","url":null,"abstract":"Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse performance than better optimized, direct convolution implementations. We propose and implement an algorithm for N-dimensional Winograd-based convolution that allows arbitrary kernel sizes and is optimized for manycore CPUs. Our algorithm achieves high hardware utilization through a series of optimizations. Our experiments show that on modern ConvNets, our optimized implementation, is on average more than 3 x, and sometimes 8 x faster than other state-of-the-art CPU implementations on an Intel Xeon Phi manycore processors. Moreover, our implementation on the Xeon Phi achieves competitive performance for 2D ConvNets and superior performance for 3D ConvNets, compared with the best GPU implementations.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133710380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Register-based implementation of the sparse general matrix-matrix multiplication on GPUs 基于寄存器的稀疏一般矩阵-矩阵乘法在gpu上的实现
Junhong Liu, Xinchao He, Weifeng Liu, Guangming Tan
{"title":"Register-based implementation of the sparse general matrix-matrix multiplication on GPUs","authors":"Junhong Liu, Xinchao He, Weifeng Liu, Guangming Tan","doi":"10.1145/3178487.3178529","DOIUrl":"https://doi.org/10.1145/3178487.3178529","url":null,"abstract":"General sparse matrix-matrix multiplication (SpGEMM) is an essential building block in a number of applications. In our work, we fully utilize GPU registers and shared memory to implement an efficient and load balanced SpGEMM in comparison with the existing implementations.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132979031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Layrub: layer-centric GPU memory reuse and data migration in extreme-scale deep learning systems Layrub:极端规模深度学习系统中以层为中心的GPU内存重用和数据迁移
Bo Liu, Wenbin Jiang, Hai Jin, Xuanhua Shi, Yang Ma
{"title":"Layrub: layer-centric GPU memory reuse and data migration in extreme-scale deep learning systems","authors":"Bo Liu, Wenbin Jiang, Hai Jin, Xuanhua Shi, Yang Ma","doi":"10.1145/3178487.3178528","DOIUrl":"https://doi.org/10.1145/3178487.3178528","url":null,"abstract":"Growing accuracy and robustness of Deep Neural Networks (DNN) models are accompanied by growing model capacity (going deeper or wider). However, high memory requirements of those models make it difficult to execute the training process in one GPU. To address it, we first identify the memory usage characteristics for deep and wide convolutional networks, and demonstrate the opportunities of memory reuse on both intra-layer and inter-layer levels. We then present Layrub, a runtime data placement strategy that orchestrates the execution of training process. It achieves layer-centric reuse to reduce memory consumption for extreme-scale deep learning that cannot be run on one single GPU.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122345728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
High-performance genomic analysis framework with in-memory computing 具有内存计算的高性能基因组分析框架
Xueqi Li, Guangming Tan, Bingchen Wang, Ninghui Sun
{"title":"High-performance genomic analysis framework with in-memory computing","authors":"Xueqi Li, Guangming Tan, Bingchen Wang, Ninghui Sun","doi":"10.1145/3178487.3178511","DOIUrl":"https://doi.org/10.1145/3178487.3178511","url":null,"abstract":"In this paper, we propose an in-memory computing framework (called GPF) that provides a set of genomic formats, APIs and a fast genomic engine for large-scale genomic data processing. Our GPF comprises two main components: (1) scalable genomic data formats and API. (2) an advanced execution engine that supports efficient compression of genomic data and eliminates redundancies in the execution engine of our GPF. We further present both system and algorithm-specific implementations for users to build genomic analysis pipeline without any acquaintance of Spark parallel programming. To test the performance of GPF, we built a WGS pipeline on top of our GPF as a test case. Our experimental data indicate that GPF completes Whole-Genome-Sequencing (WGS) analysis of 146.9G bases Human Platinum Genome in running time of 24 minutes, with over 50% parallel efficiency when used on 2048 CPU cores. Together, our GPF framework provides a fast and general engine for large-scale genomic data processing which supports in-memory computing.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"409 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122789148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Featherlight on-the-fly false-sharing detection 轻便的实时假共享检测
Milind Chabbi, Shasha Wen, Xu Liu
{"title":"Featherlight on-the-fly false-sharing detection","authors":"Milind Chabbi, Shasha Wen, Xu Liu","doi":"10.1145/3178487.3178499","DOIUrl":"https://doi.org/10.1145/3178487.3178499","url":null,"abstract":"Shared-memory parallel programs routinely suffer from false sharing---a performance degradation caused by different threads accessing different variables that reside on the same CPU cacheline and at least one variable is modified. State-of-the-art tools detect false sharing via a heavyweight process of logging memory accesses and feeding the ensuing access traces to an offline cache simulator. We have developed Feather, a lightweight, on-the-fly false-sharing detection tool. Feather achieves low overhead by exploiting two hardware features ubiquitous in commodity CPUs: the performance monitoring units (PMU) and debug registers. Additionally, Feather is a first-of-its-kind tool to detect false sharing in multi-process applications that use shared memory. Feather allowed us to scale false-sharing detection to myriad codes. Feather detected several false-sharing cases in important multi-core and multi-process codes including previous PPoPP artifacts. Eliminating false sharing resulted in dramatic (up to 16x) speedups.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132187276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
A persistent lock-free queue for non-volatile memory 用于非易失性内存的持久无锁队列
Michal Friedman, M. Herlihy, Virendra J. Marathe, E. Petrank
{"title":"A persistent lock-free queue for non-volatile memory","authors":"Michal Friedman, M. Herlihy, Virendra J. Marathe, E. Petrank","doi":"10.1145/3178487.3178490","DOIUrl":"https://doi.org/10.1145/3178487.3178490","url":null,"abstract":"Non-volatile memory is expected to coexist with (or even displace) volatile DRAM for main memory in upcoming architectures. This has led to increasing interest in the problem of designing and specifying durable data structures that can recover from system crashes. Data structures may be designed to satisfy stricter or weaker durability guarantees to provide a balance between the strength of the provided guarantees and performance overhead. This paper proposes three novel implementations of a concurrent lock-free queue. These implementations illustrate algorithmic challenges in building persistent lock-free data structures with different levels of durability guarantees. In presenting these challenges, the proposed algorithmic designs, and the different durability guarantees, we hope to shed light on ways to build a wide variety of durable data structures. We implemented the various designs and compared their performance overhead to a simple queue design for standard (volatile) memory.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131377940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 110
Two concurrent data structures for efficient datalog query processing 两个并发数据结构,用于高效的数据查询处理
Herbert Jordan, Bernhard Scholz, Pavle Subotic
{"title":"Two concurrent data structures for efficient datalog query processing","authors":"Herbert Jordan, Bernhard Scholz, Pavle Subotic","doi":"10.1145/3178487.3178525","DOIUrl":"https://doi.org/10.1145/3178487.3178525","url":null,"abstract":"In recent years, Datalog has gained popularity for the implementation of advanced data analysis. Applications benefit from Datalog's high-level, declarative syntax, and availability of efficient algorithms for computing solutions. The efficiency of Datalog engines has reached a point where engines such as Soufflé have reported performance results comparable to low-level hand-crafted alternatives [3].","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126913180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信