Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming最新文献

筛选
英文 中文
Corder 货物订单
Yuang Chen, Y. Chung
{"title":"Corder","authors":"Yuang Chen, Y. Chung","doi":"10.1145/3437801.3441606","DOIUrl":"https://doi.org/10.1145/3437801.3441606","url":null,"abstract":"The intrinsic irregular data structure of graphs often causes poor cache utilization thus deteriorates the performance of graph analytics. Prior works have designed a variety of graph reordering methods to improve cache efficiency. However, little insight has been provided into the issue of workload imbalance for multicore systems. In this work, we identify that a major factor affecting the performance is the unevenly distributed computation load amongst cores. To cope with this problem, we propose cache-aware reordering (Corder), a lightweight reordering algorithm that facilitates workload balance as well as cache optimization. Comprehensive performance evaluation of Corder is conducted on various graph applications and datasets. We observe that Corder yields speedup of up to 2.59× (on average 1.47×) over original graphs.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122810628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Advanced synchronization techniques for task-based runtime systems 基于任务的运行时系统的高级同步技术
David Álvarez, Kevin Sala, Marcos Maroñas, Aleix Roca, Vicencc Beltran
{"title":"Advanced synchronization techniques for task-based runtime systems","authors":"David Álvarez, Kevin Sala, Marcos Maroñas, Aleix Roca, Vicencc Beltran","doi":"10.1145/3437801.3441601","DOIUrl":"https://doi.org/10.1145/3437801.3441601","url":null,"abstract":"Task-based programming models like OmpSs-2 and OpenMP provide a flexible data-flow execution model to exploit dynamic, irregular and nested parallelism. Providing an efficient implementation that scales well with small granularity tasks remains a challenge, and bottlenecks can manifest in several runtime components. In this paper, we analyze the limiting factors in the scalability of a task-based runtime system and propose individual solutions for each of the challenges, including a wait-free dependency system and a novel scalable scheduler design based on delegation. We evaluate how the optimizations impact the overall performance of the runtime, both individually and in combination. We also compare the resulting runtime against state of the art OpenMP implementations, showing equivalent or better performance, especially for fine-grained tasks.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"190 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124212508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
On group mutual exclusion for dynamic systems 关于动态系统的群互斥问题
Shreyas Gokhale, Sahil Dhoked, N. Mittal
{"title":"On group mutual exclusion for dynamic systems","authors":"Shreyas Gokhale, Sahil Dhoked, N. Mittal","doi":"10.1145/3437801.3441608","DOIUrl":"https://doi.org/10.1145/3437801.3441608","url":null,"abstract":"The group mutual exclusion (GME) problem is a generalization of the classical mutual exclusion problem in which every critical section is associated with a type or session. Critical sections belonging to the same session can execute concurrently, whereas critical sections belonging to different sessions must be executed serially. The well-known read-write mutual exclusion problem is a special case of the group mutual exclusion problem. In this work, we present a suite of GME algorithms for an asynchronous shared-memory system under the cache coherent (CC) model. Our GME algorithms have the feature that they do not require an a priori knowledge of the set of processes in the system and all their important complexity measures depend only on the number of different sessions that can be requested by processes. This makes them especially suitable for a dynamic system in which the set of processes may change over time. To our knowledge, no existing GME algorithm satisfies this property. Our GME algorithms provide different trade offs between complexity and fairness.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124454107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Lightweight preemptive user-level threads 轻量级抢占式用户级线程
Shumpei Shiina, Shintaro Iwasaki, K. Taura, P. Balaji
{"title":"Lightweight preemptive user-level threads","authors":"Shumpei Shiina, Shintaro Iwasaki, K. Taura, P. Balaji","doi":"10.1145/3437801.3441610","DOIUrl":"https://doi.org/10.1145/3437801.3441610","url":null,"abstract":"Many-to-many mapping models for user- to kernel-level threads (or \"M:N threads\") have been extensively studied for decades as a lightweight substitute for current Pthreads implementations that provide a simple one-to-one mapping (\"1:1 threads\"). M:N threads derive performance from their ability to allow users to context switch between threads and control their scheduling entirely in user space with no kernel involvement. This same ability, however, causes M:N threads to lose the kernel-provided ability of implicit OS preemption---threads have to explicitly yield control for other threads to be scheduled. Hence, programs over nonpreemptive M:N threads can cause core starvation, loss of prioritization, and, sometimes, deadlock unless programs are written to explicitly yield in proper places. This paper explores two techniques for M:N threads to efficiently achieve implicit preemption similar to 1:1 threads: signal-yield and KLT-switching. Overheads of these techniques, with our optimizations, can be less than 1% compared with nonpreemptive M:N threads. Our evaluation with three applications demonstrates that our preemption techniques for M:N threads improve core utilization and enhance the performance by utilizing lightweight context switching and flexible scheduling of M:N threads.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128257916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Exploring deep reuse in winograd CNN inference winograd CNN推理中的深度重用探索
Ruofan Wu, Feng Zhang, Zhen Zheng, Xiaoyong Du, Xipeng Shen
{"title":"Exploring deep reuse in winograd CNN inference","authors":"Ruofan Wu, Feng Zhang, Zhen Zheng, Xiaoyong Du, Xipeng Shen","doi":"10.1145/3437801.3441588","DOIUrl":"https://doi.org/10.1145/3437801.3441588","url":null,"abstract":"Convolutional neural networks (CNNs), as representatives of deep learning, are one of the most commonly used neural networks in applications such as graphic image analysis. However, CNN has heavy computation patterns; network training processes could take several hours even with modern processors. Different from the training process, the inference process is more often executed on devices with low computing power, such as CPUs. Fortunately, a minimal filtering algorithm, Winograd, can reduce the convolution computations by reducing the number of multiplication operations. We find that the Winograd convolution can be further accelerated by reusing the similar data and computation patterns, which is called deep reuse.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130986990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
An efficient uncertain graph processing framework for heterogeneous architectures 一种高效的异构体系结构不确定图处理框架
Heng Zhang, Lingda Li, Donglin Zhuang, Rui Liu, Shuang Song, Dingwen Tao, Y. Wu, S. Song
{"title":"An efficient uncertain graph processing framework for heterogeneous architectures","authors":"Heng Zhang, Lingda Li, Donglin Zhuang, Rui Liu, Shuang Song, Dingwen Tao, Y. Wu, S. Song","doi":"10.1145/3437801.3441584","DOIUrl":"https://doi.org/10.1145/3437801.3441584","url":null,"abstract":"Uncertain or probabilistic graphs have been ubiquitously used in many emerging applications. Previously CPU based techniques were proposed to use sampling but suffer from (1) low computation efficiency and large memory overhead, (2) low degree of parallelism, and (3) nonexistent general framework to effectively support programming uncertain graph applications. To tackle these challenges, we propose a general uncertain graph processing framework for multi-GPU systems, named BPGraph. Integrated with our highly-efficient path sampling method, BPGraph can support a wide range of uncertain graph algorithms' development and optimization. Extensive evaluation demonstrates a significant performance improvement from BPGraph over the state-of-the-art uncertain graph sampling techniques.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125160746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
ShadowVM
Zhifang Li, Mingcong Han, Shangwei Wu, Chuliang Weng
{"title":"ShadowVM","authors":"Zhifang Li, Mingcong Han, Shangwei Wu, Chuliang Weng","doi":"10.1145/3437801.3441595","DOIUrl":"https://doi.org/10.1145/3437801.3441595","url":null,"abstract":"With the development of the big data ecosystem, large-scale data analytics has become more prevalent in the past few years. Apache Spark, etc., provide a flexible approach for scalable processing upon massive data. However, they are not designed for handling computing-intensive workloads due to the restrictions of JVM runtime. In contrast, GPU has been the de facto accelerator for graphics rendering and deep learning in recent years. Nevertheless, the current architecture makes it difficult to take advantage of GPUs and other accelerators in the big data world. Now, it is time to break down this obstacle by changing the fundamental architecture. To integrate accelerators efficiently, we decouple the control plane and the data plane within big data systems via action shadowing. The control plane keeps logic information to fit well with the host systems like Spark, while the data plane holds data and performs execution upon bare metal CPUs and GPUs. Under this decoupled architecture, both the control plane and the data plane could leverage the appropriate approaches without breaking existing mechanisms. Based on this idea, we implement an accelerated data plane, namely ShadowVM. In our experiments on the SSB benchmark, ShadowVM lifts the JVM-based Spark with up to 14.7× speedup. Furthermore, ShadowVM could also outperform the GPU-only fashion by adopting mixed CPU-GPU execution.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115092976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Scaling implicit parallelism via dynamic control replication 通过动态控制复制扩展隐式并行性
Michael A. Bauer, Wonchan Lee, Elliott Slaughter, Zhihao Jia, M. D. Renzo, Manolis Papadakis, G. Shipman, P. McCormick, M. Garland, A. Aiken
{"title":"Scaling implicit parallelism via dynamic control replication","authors":"Michael A. Bauer, Wonchan Lee, Elliott Slaughter, Zhihao Jia, M. D. Renzo, Manolis Papadakis, G. Shipman, P. McCormick, M. Garland, A. Aiken","doi":"10.1145/3437801.3441587","DOIUrl":"https://doi.org/10.1145/3437801.3441587","url":null,"abstract":"We present dynamic control replication, a run-time program analysis that enables scalable execution of implicitly parallel programs on large machines through a distributed and efficient dynamic dependence analysis. Dynamic control replication distributes dependence analysis by executing multiple copies of an implicitly parallel program while ensuring that they still collectively behave as a single execution. By distributing and parallelizing the dependence analysis, dynamic control replication supports efficient, on-the-fly computation of dependences for programs with arbitrary control flow at scale. We describe an asymptotically scalable algorithm for implementing dynamic control replication that maintains the sequential semantics of implicitly parallel programs. An implementation of dynamic control replication in the Legion runtime delivers the same programmer productivity as writing in other implicitly parallel programming models, such as Dask or TensorFlow, while providing better performance (11.4X and 14.9X respectively in our experiments), and scalability to hundreds of nodes. We also show that dynamic control replication provides good absolute performance and scaling for HPC applications, competitive in many cases with explicitly parallel programming systems.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128019637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Simplifying low-level GPU programming with GAS 用GAS简化低级GPU编程
D. Yan, Wei Wang, X. Chu
{"title":"Simplifying low-level GPU programming with GAS","authors":"D. Yan, Wei Wang, X. Chu","doi":"10.1145/3437801.3441591","DOIUrl":"https://doi.org/10.1145/3437801.3441591","url":null,"abstract":"Many low-level optimizations for NVIDIA GPU can only be implemented in native hardware assembly (SASS). However, programming in SASS is unproductive and not portable. To simplify low-level GPU programming, we present GAS (Gpu ASsembly), a PTX-like language that provides a stable instruction set across hardware architectures while giving programmers a low-level control of code execution. We demonstrate that GAS can be used with ease for low-level benchmarking and performance tuning in the context of Tensor Core HGEMM.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117204121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ApproxTuner ApproxTuner
Hashim Sharif, Yifan Zhao, Maria Kotsifakou, Akash Kothari, Ben Schreiber, Elizabeth Wang, Yasmin Sarita, Nathan Zhao, Keyur Joshi, Vikram S. Adve, Sasa Misailovic, S. Adve
{"title":"ApproxTuner","authors":"Hashim Sharif, Yifan Zhao, Maria Kotsifakou, Akash Kothari, Ben Schreiber, Elizabeth Wang, Yasmin Sarita, Nathan Zhao, Keyur Joshi, Vikram S. Adve, Sasa Misailovic, S. Adve","doi":"10.1145/3437801.3446108","DOIUrl":"https://doi.org/10.1145/3437801.3446108","url":null,"abstract":"Manually optimizing the tradeoffs between accuracy, performance and energy for resource-intensive applications with flexible accuracy or precision requirements is extremely difficult. We present ApproxTuner, an automatic framework for accuracy-aware optimization of tensor-based applications while requiring only high-level end-to-end quality specifications. ApproxTuner implements and manages approximations in algorithms, system software, and hardware. The key contribution in ApproxTuner is a novel three-phase approach to approximation-tuning that consists of development-time, install-time, and run-time phases. Our approach decouples tuning of hardware-independent and hardware-specific approximations, thus providing retargetability across devices. To enable efficient autotuning of approximation choices, we present a novel accuracy-aware tuning technique called predictive approximation-tuning, which significantly speeds up autotuning by analytically predicting the accuracy impacts of approximations. We evaluate ApproxTuner across 10 convolutional neural networks (CNNs) and a combined CNN and image processing benchmark. For the evaluated CNNs, using only hardware-independent approximation choices we achieve a mean speedup of 2.1x (max 2.7x) on a GPU, and 1.3x mean speedup (max 1.9x) on the CPU, while staying within 1 percentage point of inference accuracy loss. For two different accuracy-prediction models, ApproxTuner speeds up tuning by 12.8x and 20.4x compared to conventional empirical tuning while achieving comparable benefits.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114732193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信