2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)最新文献_第4页

Hash Map Inlining 哈希映射内联

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967949

Dibakar Gope, Mikko H. Lipasti

{"title":"Hash Map Inlining","authors":"Dibakar Gope, Mikko H. Lipasti","doi":"10.1145/2967938.2967949","DOIUrl":"https://doi.org/10.1145/2967938.2967949","url":null,"abstract":"Scripting languages like Javascript and PHP are widely used to implement application logic for dynamically-generated web pages. Their popularity is due in large part to their flexible syntax and dynamic type system, which enable rapid turnaround time for prototyping, releasing, and updating web site features and capabilities. The most common complex data structure in these languages is the hash map, which is used to store key-value pairs. In many cases, hash maps with a fixed set of keys are used in lieu of explicitly defined classes or structures, as would be common in compiled languages like Java or C++. Unfortunately, the runtime overhead of key lookup and value retrieval is quite high, especially relative to the direct offsets that compiled languages can use to access class members. Furthermore, key lookup and value retrieval incur high microarchitectural costs as well, since the paths they execute contain unpredictable branches and many cache accesses, leading to substantially higher numbers of branch mispredicts and cache misses per access to the hashmap. This paper quantifies these overheads, describes a compiler algorithm that discovers common use cases for hash maps and inlines them so that keys are accessed with direct offsets, and reports measured performance benefits on real hardware. A prototype implementation in the HipHop VM infrastructure shows promising performance benefits for a broad array of hash map-intensive server-side PHP applications, up to 37.6% and averaging 18.81%, improves SPECWeb throughput by 7.71% (banking) and 11.71% (e-commerce).","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"193 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115631684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

POSTER - collective dynamic parallelism for directive based GPU programming languages and compilers 基于指令的GPU编程语言和编译器的集体动态并行性

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2974056

Guray Ozen, E. Ayguadé, Jesús Labarta

引用次数: 2

μC-States: Fine-grained GPU datapath power management μC-States:细粒度GPU数据路径电源管理

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967941

Onur Kayiran, Adwait Jog, Ashutosh Pattnaik, Rachata Ausavarungnirun, Xulong Tang, M. Kandemir, G. Loh, O. Mutlu, C. Das

{"title":"μC-States: Fine-grained GPU datapath power management","authors":"Onur Kayiran, Adwait Jog, Ashutosh Pattnaik, Rachata Ausavarungnirun, Xulong Tang, M. Kandemir, G. Loh, O. Mutlu, C. Das","doi":"10.1145/2967938.2967941","DOIUrl":"https://doi.org/10.1145/2967938.2967941","url":null,"abstract":"To improve the performance of Graphics Processing Units (GPUs) beyond simply increasing core count, architects are recently adopting a scale-up approach: the peak throughput and individual capabilities of the GPU cores are increasing rapidly. This big-core trend in GPUs leads to various challenges, including higher static power consumption and lower and imbalanced utilization of the datapath components of a big core. As we show in this paper, two key problems ensue: (1) the lower and imbalanced datapath utilization can waste power as an application does not always utilize all portions of the big core datapath, and (2) the use of big cores can lead to application performance degradation in some cases due to the higher memory system contention caused by the more memory requests generated by each big core. This paper introduces a new analysis of datapath component utilization in big-core GPUs based on queuing theory principles. Building on this analysis, we introduce a fine-grained dynamic power- and clock-gating mechanism for the entire datapath, called μC-States, which aims to minimize power consumption by turning off or tuning-down datapath components that are not bottlenecks for the performance of the running application. Our experimental evaluation demonstrates that μC-States significantly reduces both static and dynamic power consumption in a big-core GPU, while also significantly improving the performance of applications affected by high memory system contention. We also show that our analysis of datapath component utilization can guide scheduling and design decisions in a GPU architecture that contains heterogeneous cores.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"91 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129296061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Integrating algorithmic parameters into benchmarking and design space exploration in 3D scene understanding 将算法参数整合到三维场景理解的基准测试和设计空间探索中

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967963

Sreekar Shenoy, Bruno Bodin, Luigi Nardi, M. Zia, Harry Wagstaff, Govind Sreekar Shenoy, M. Emani, John Mawer, Christos Kotselidis, A. Nisbet, M. Luján, Björn Franke, P. Kelly, M. O’Boyle

{"title":"Integrating algorithmic parameters into benchmarking and design space exploration in 3D scene understanding","authors":"Sreekar Shenoy, Bruno Bodin, Luigi Nardi, M. Zia, Harry Wagstaff, Govind Sreekar Shenoy, M. Emani, John Mawer, Christos Kotselidis, A. Nisbet, M. Luján, Björn Franke, P. Kelly, M. O’Boyle","doi":"10.1145/2967938.2967963","DOIUrl":"https://doi.org/10.1145/2967938.2967963","url":null,"abstract":"System designers typically use well-studied benchmarks to evaluate and improve new architectures and compilers. We design tomorrow's systems based on yesterday's applications. In this paper we investigate an emerging application, 3D scene understanding, likely to be significant in the mobile space in the near future. Until now, this application could only run in real-time on desktop GPUs. In this work, we examine how it can be mapped to power constrained embedded systems. Key to our approach is the idea of incremental co-design exploration, where optimization choices that concern the domain layer are incrementally explored together with low-level compiler and architecture choices. The goal of this exploration is to reduce execution time while minimizing power and meeting our quality of result objective. As the design space is too large to exhaustively evaluate, we use active learning based on a random forest predictor to find good designs. We show that our approach can, for the first time, achieve dense 3D mapping and tracking in the real-time range within a 1W power budget on a popular embedded device. This is a 4.8× execution time improvement and a 2.8× power reduction compared to the state-of-the-art.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132314023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

POSTER: An integrated vector-scalar design on an in-order ARM core 海报:一个集成的矢量标量设计在一个有序的ARM核心

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2974057

Milan Stanic, Oscar Palomar, Timothy Hayes, Ivan Ratković, O. Unsal, A. Cristal, M. Valero

引用次数: 0

Sparso: Context-driven optimizations of sparse linear algebra 稀疏线性代数的上下文驱动优化

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967943

Hongbo Rong, Jongsoo Park, Lingxiang Xiang, T. A. Anderson, M. Smelyanskiy

{"title":"Sparso: Context-driven optimizations of sparse linear algebra","authors":"Hongbo Rong, Jongsoo Park, Lingxiang Xiang, T. A. Anderson, M. Smelyanskiy","doi":"10.1145/2967938.2967943","DOIUrl":"https://doi.org/10.1145/2967938.2967943","url":null,"abstract":"The sparse matrix is a key data structure in various domains such as high-performance computing, machine learning, and graph analytics. To maximize performance of sparse matrix operations, it is especially important to optimize across the operations and not just within individual operations. While a straightforward per-operation mapping to library routines misses optimization opportunities, manually optimizing across the boundary of library routines is time-consuming and error-prone, sacrificing productivity. This paper introduces Sparso, a framework that automates such optimizations, enabling both high performance and high productivity. In Sparso, a compiler and sparse linear algebra libraries collaboratively discover and exploit context, which we define as the invariant properties of matrices and relationships between them in a program. We present compiler analyses, namely collective reordering analysis and matrix property discovery, to discover the context. The context discovered from these analyses drives key optimizations across library routines and matrices. We have implemented Sparso with the Julia language, Intel MKL and SpMP libraries. We evaluate our context-driven optimizations in 6 representative sparse linear algebra algorithms. Compared with a baseline that invokes high-performance libraries without context optimizations, Sparso results in 1.2~17x (average 5.7x) speedups. Our approach of compiler-library collaboration and context-driven optimizations should be also applicable to other productivity languages such as Matlab, Python, and R.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121307784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Resource conscious reuse-driven tiling for GPUs 面向 GPU 的资源再利用驱动平铺技术

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967967

P. Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, L. Pouchet, A. Rountev, P. Sadayappan

{"title":"Resource conscious reuse-driven tiling for GPUs","authors":"P. Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, L. Pouchet, A. Rountev, P. Sadayappan","doi":"10.1145/2967938.2967967","DOIUrl":"https://doi.org/10.1145/2967938.2967967","url":null,"abstract":"Computations involving successive application of 3D stencil operators are widely used in many application domains, such as image processing, computational electromagnetics, seismic processing, and climate modeling. Enhancement of temporal and spatial locality via tiling is generally required in order to overcome performance bottlenecks due to limited bandwidth to global memory on GPUs. However, the low shared memory capacity on current GPU architectures makes effective tiling for 3D stencils very challenging - several previous domain-specific compilers for stencils have demonstrated very high performance for 2D stencils, but much lower performance on 3D stencils. In this paper, we develop an effective resource-constraint-driven approach for automated GPU code generation for stencils. We present a fusion technique that judiciously fuses stencil computations to minimize data movement, while controlling computational redundancy and maximizing resource usage. The fusion model subsumes time tiling of iterated stencils, and can be easily adapted to different GPU architectures. We integrate the fusion model into a code generator that makes effective use of scarce shared memory and registers to achieve high performance. The effectiveness of the automated model-driven code generator is demonstrated through experimental results on a number of benchmarks, comparing against various previously developed GPU code generators.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130137581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Power tuning HPC jobs on power-constrained systems 在功率受限的系统上对HPC作业进行功率调优

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967961

Neha Gholkar, F. Mueller, B. Rountree

{"title":"Power tuning HPC jobs on power-constrained systems","authors":"Neha Gholkar, F. Mueller, B. Rountree","doi":"10.1145/2967938.2967961","DOIUrl":"https://doi.org/10.1145/2967938.2967961","url":null,"abstract":"As we approach the exascale era, power has become a primary bottleneck. The US Department of Energy has set a power constraint of 20MW on each exascale machine. To be able achieve one exaflop under this constraint, it is necessary that we use power intelligently to maximize performance under a power constraint. Most production-level parallel applications that run on a supercomputer are tightly-coupled parallel applications. A naϊve approach of enforcing a power constraint for a parallel job would be to divide the job's power budget uniformly across all the processors. However, previous work has shown that a power capped job suffers from performance variation of otherwise identical processors leading to overall sub-optimal performance. We propose a 2-level hierarchical variation-aware approach of managing power at machine-level. At the macro level, PPartition partitions a machine's power budget across jobs to assign a power budget to each job running on the system such that the machine never exceeds its power budget. At the micro level, PTune makes job-centric decisions by taking the performance variation into account. For every moldable job, PTune determines the optimal number of processors, the selection of processors and the distribution of the job's power budget across them, with the goal of maximizing the job's performance under its power budget. Experiments show that, at the micro level, PTune achieves a performance improvement of up to 29% compared to a naϊve approach. PTune does not lead to any performance degradation, yet frees up almost 40% of the processors for the same performance as that of the naϊve approach under a hard power bound. At the macro level, PPartition is able to achieve a throughput improvement of 5-35% compared to uniform power distribution.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116053232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 75

POSTER - Firestorm: Operating systems for power-constrained architectures 海报- Firestorm:用于功率受限架构的操作系统

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2975607

S. Panneerselvam, M. Swift

{"title":"POSTER - Firestorm: Operating systems for power-constrained architectures","authors":"S. Panneerselvam, M. Swift","doi":"10.1145/2967938.2975607","DOIUrl":"https://doi.org/10.1145/2967938.2975607","url":null,"abstract":"Moore's law paved the way for doubling the transistors in the same chip area with every generation. However, with the end of Dennard's scaling, voltage and hence the power draw of transistors is no longer dropping proportionally to size. As a result, modern processors cannot use all parts of the processor simultaneously without exceeding the power limit. This manifests as an increasing proportion of dark silicon [4]. In other words, the compute capacity of current and future processors is and will be over-provisioned with respect to the available power. Power limits are influenced by different factors such as the capacity of power distribution infrastructure, battery supply limits, and the thermal capacity of the system. Power limits in datacenters can arise from underprovisioning power distribution units relative to peak power draw. Energy limits are also dictated by the limited capacity of batteries. However, in many systems, the primary limit comes not from the ability to acquire power, but instead from the ability to dissipate power as heat once it has been used. Thermal limits are dictated by the physical properties of the processor materials and also comfort of the user.Thus, power is limited to prevent processor chips from overheating, which can lead to thermal breakdown. As a result, the maximum performance of a system is limited by its cooling capacity, which determines its ability to dissipate heat. Cooling capacity varies across the computing landscape, from servers with external chilled air to desktops with large fans to laptops to fan-less mobile devices.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"167 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132092687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Student research poster: A scalable general purpose system for large-scale graph processing 学生研究海报:用于大规模图形处理的可扩展通用系统

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2971465

Jiawen Sun

引用次数: 2