2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)最新文献

Student research poster: Compiling Boolean circuits to non-deterministic branching programs to be implemented by light switching circuits 学生研究海报:编译布尔电路到由光开关电路实现的非确定性分支程序

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2971468

V. Tartakovsky

引用次数: 0

Scheduling techniques for GPU architectures with processing-in-memory capabilities 具有内存处理能力的GPU架构的调度技术

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967940

Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, M. Kandemir, O. Mutlu, C. Das

{"title":"Scheduling techniques for GPU architectures with processing-in-memory capabilities","authors":"Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, M. Kandemir, O. Mutlu, C. Das","doi":"10.1145/2967938.2967940","DOIUrl":"https://doi.org/10.1145/2967938.2967940","url":null,"abstract":"Processing data in or near memory (PIM), as opposed to in conventional computational units in a processor, can greatly alleviate the performance and energy penalties of data transfers from/to main memory. Graphics Processing Unit (GPU) architectures and applications, where main memory bandwidth is a critical bottleneck, can benefit from the use of PIM. To this end, an application should be properly partitioned and scheduled to execute on either the main, powerful GPU cores that are far away from memory or the auxiliary, simple GPU cores that are close to memory (e.g., in the logic layer of 3D-stacked DRAM). This paper investigates two key code scheduling issues in such a GPU architecture that has PIM capabilities, to maximize performance and energy-efficiency: (1) how to automatically identify the code segments, or kernels, to be offloaded to the cores in memory, and (2) how to concurrently schedule multiple kernels on the main GPU cores and the auxiliary GPU cores in memory. We develop two new runtime techniques: (1) a regression-based affinity prediction model and mechanism that accurately identifies which kernels would benefit from PIM and offloads them to GPU cores in memory, and (2) a concurrent kernel management mechanism that uses the affinity prediction model, a new kernel execution time prediction model, and kernel dependency information to decide which kernels to schedule concurrently on main GPU cores and the GPU cores in memory. Our experimental evaluations across 25 GPU applications demonstrate that these two techniques can significantly improve both application performance (by 25% and 42%, respectively, on average) and energy efficiency (by 28% and 27%).","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126642280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 175

OAWS: Memory Occlusion Aware Warp Scheduling OAWS:内存闭塞感知Warp调度

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967947

Bin Wang, Yue Zhu, Weikuan Yu

引用次数: 24

Online scalability characterization of data-parallel programs on many cores 多核数据并行程序的在线可伸缩性特性

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967960

Younghyun Cho, S. Oh, Bernhard Egger

引用次数: 9

A static cut-off for task parallel programs 任务并行程序的静态截止

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967968

Shintaro Iwasaki, K. Taura

引用次数: 18

Combating the reliability challenge of GPU register file at low supply voltage 解决低电源电压下GPU寄存器文件可靠性问题

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967951

Jingweijia Tan, S. Song, Kaige Yan, Xin Fu, A. Márquez, D. Kerbyson

{"title":"Combating the reliability challenge of GPU register file at low supply voltage","authors":"Jingweijia Tan, S. Song, Kaige Yan, Xin Fu, A. Márquez, D. Kerbyson","doi":"10.1145/2967938.2967951","DOIUrl":"https://doi.org/10.1145/2967938.2967951","url":null,"abstract":"Supply voltage reduction is an effective approach to significantly reduce GPU energy consumption. As the largest on-chip storage structure, the GPU register file becomes the reliability hotspot that prevents further supply voltage reduction below the safe limit (Vmin) due to process variation effects. This work addresses the reliability challenge of the GPU register file at low supply voltages, which is an essential first step for aggressive supply voltage reduction of the entire GPU chip. To better understand the reliability issues posed by undervolting and its energy-saving potential, we first rigorously model and analyze the process variation impact on the GPU register file at different voltages. By further analyzing the GPU architecture, we make a key observation that the time GPU registers contain useless data (i.e., dead time) is long, providing a unique opportunity to enhance register reliability. We then propose GR-Guard, an architectural solution that leverages long register dead time to enable reliable operations from unreliable register file at low voltages. GR-Guard is both effective and low-cost, and does not affect normal (i.e., non-faulty) register accesses. Experimental results show that for a 28nm baseline GPU under aggressive voltage reduction, GR-Guard can maintain the register file reliability with less than 2% overall performance degradation, while achieving an average of 31% energy reduction across various applications.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124226941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

POSTER: Exploiting asymmetric multi-core processors with flexible system software POSTER:利用非对称多核处理器和灵活的系统软件

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2976038

Kallia Chronaki, Miquel Moretó, Marc Casas, Alejandro Rico, Rosa M. Badia, E. Ayguadé, Jesús Labarta, M. Valero

引用次数: 2

Auto-tuning Spark big data workloads on POWER8: Prediction-based dynamic SMT threading POWER8上的Spark大数据工作负载自动调优:基于预测的动态SMT线程

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967957

Zhen Jia, Chao Xue, Guancheng Chen, Jianfeng Zhan, Lixin Zhang, Yonghua Lin, H. P. Hofstee

{"title":"Auto-tuning Spark big data workloads on POWER8: Prediction-based dynamic SMT threading","authors":"Zhen Jia, Chao Xue, Guancheng Chen, Jianfeng Zhan, Lixin Zhang, Yonghua Lin, H. P. Hofstee","doi":"10.1145/2967938.2967957","DOIUrl":"https://doi.org/10.1145/2967938.2967957","url":null,"abstract":"Much research work devotes to tuning big data analytics in modern data centers, since even a small percentage of performance improvement immediately translates to huge cost savings because of the large scale. Simultaneous multithreading (SMT) receives great interest from data center communities, as it has the potential to boost performance of big data analytics by increasing the processor resources utilization. For example, the emerging processor architectures like POWER8 support up to 8-way multithreading. However, as different big data workloads have disparate architectural characteristics, how to identify the most efficient SMT configuration to achieve the best performance is challenging in terms of both complex application behaviors and processor architectures. In this paper, we specifically focus on auto-tuning SMT configuration for Spark-based big data workloads on POWER8. However, our methodology could be generalized and extended to other programming software stacks and other architectures. We propose a prediction-based dynamic SMT threading (PBDST) framework to adjust the thread count in SMT cores on POWER8 processors by using versatile machine learning algorithms. Its innovation lies in adopting online SMT configuration predictions derived from microarchitecture level profiling, to regulate the thread counts that could achieve nearly optimal performance. Moreover it is implemented at Spark software stack layer and transparent to user applications. After evaluating a large set of machine learning algorithms, we choose the most efficient ones to perform online predictions. The experimental results demonstrate that our approach can achieve up to 56.3% performance improvement and an average performance gain of 16.2% in comparison with the default configuration-the maximum SMT configuration-SMT8 on our system.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133608905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

POSTER: ξ-TAO: A cache-centric execution model and runtime for deep parallel multicore topologies POSTER: ξ-TAO:一个以缓存为中心的执行模型和运行时，用于深度并行多核拓扑

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2974052

M. Pericàs

{"title":"POSTER: ξ-TAO: A cache-centric execution model and runtime for deep parallel multicore topologies","authors":"M. Pericàs","doi":"10.1145/2967938.2974052","DOIUrl":"https://doi.org/10.1145/2967938.2974052","url":null,"abstract":"We have analyzed the ξ-TAO model and runtime with three benchmarks: a parallel hybrid quicksort/mergesort, a 2D Jacobi stencil, and the Unbalanced Tree Search (UTS) benchmark. We run ξ-TAO implementations of these benchmarks on a Dell PowerEdge R815 server with four AMD Opteron 6348 processors, totalling 8 NUMA nodes and 48 cores. Figure 2 shows the scalability of UTS+ξ-TAO compared to thread-centric runtimes based on work stealing (MassiveThreads [6], Intel TBB) and hierarchical WS+PDF (Qthreads [10]). UTS was implemented in ξ-TAO by grouping sibling nodes into a TAO and attaching a static scheduler. UTS has a very small working set, hence the best performance is achieved when each TAO is mapped to a single core (ξ-TAO-w1). The combination of tight reuse, pre-built task groups and static scheduling results in high scalability for UTS+ξ-TAO. Unlike UTS, the parallel sorting and 2D Jacobi benchmarks are memory intensive benchmarks. By selecting assemblies of width two (i.e., core-width of the L2 caches) and six (i.e., core-width of the L3 cache) ξ-TAO is able to outperform competing approaches thanks to better management of available memory bandwidth and shared cache capacity.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"154 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124367952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Characterizing and optimizing the performance of multithreaded programs under interference 干扰下多线程程序性能的表征与优化

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967939

Yong Zhao, J. Rao, Qing Yi

{"title":"Characterizing and optimizing the performance of multithreaded programs under interference","authors":"Yong Zhao, J. Rao, Qing Yi","doi":"10.1145/2967938.2967939","DOIUrl":"https://doi.org/10.1145/2967938.2967939","url":null,"abstract":"As virtualization becomes ubiquitous in datacenters, there is a growing interest in characterizing application performance in multi-tenant environments to improve datacenter resource management. The performance of parallel programs is notoriously difficult to reason about in virtualized environments. Although performance degradations caused by virtualization and interferences have been extensively studied, there still lacks a comprehensive understanding why parallel programs have unpredictable slowdowns when co-located with different types of workloads. This paper presents a systematic and quantitative study of multithreaded performance under interference. We design synthetic workloads to emulate different types of interference and study the behavior of parallel programs under such interferences. We find that unpredictable performance is the result of complex interplays between the design of the program, the memory hierarchy of the host system, and the CPU scheduling at the hypervisor. To understand the intricate relationships between multiple factors, we decompose parallel runtime into compute, synchronization and steal time, and use the runtime breakdown to measure program progress and identify execution inefficiency under interference. Based on these findings, we develop an online approach to predicting performance slowdown without requiring parallel programs to be completed, and devise two scheduling optimizations at the hypervisor to reduce slowdowns. Experimental results with Xen and representative parallel workloads show that the online performance prediction achieves on average less than 4.5% error and the optimizations reduce runtime slowdown by as much as 38% compared to stock Xen.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124037247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13