2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)最新文献

筛选
英文 中文
Hash Map Inlining 哈希映射内联
Dibakar Gope, Mikko H. Lipasti
{"title":"Hash Map Inlining","authors":"Dibakar Gope, Mikko H. Lipasti","doi":"10.1145/2967938.2967949","DOIUrl":"https://doi.org/10.1145/2967938.2967949","url":null,"abstract":"Scripting languages like Javascript and PHP are widely used to implement application logic for dynamically-generated web pages. Their popularity is due in large part to their flexible syntax and dynamic type system, which enable rapid turnaround time for prototyping, releasing, and updating web site features and capabilities. The most common complex data structure in these languages is the hash map, which is used to store key-value pairs. In many cases, hash maps with a fixed set of keys are used in lieu of explicitly defined classes or structures, as would be common in compiled languages like Java or C++. Unfortunately, the runtime overhead of key lookup and value retrieval is quite high, especially relative to the direct offsets that compiled languages can use to access class members. Furthermore, key lookup and value retrieval incur high microarchitectural costs as well, since the paths they execute contain unpredictable branches and many cache accesses, leading to substantially higher numbers of branch mispredicts and cache misses per access to the hashmap. This paper quantifies these overheads, describes a compiler algorithm that discovers common use cases for hash maps and inlines them so that keys are accessed with direct offsets, and reports measured performance benefits on real hardware. A prototype implementation in the HipHop VM infrastructure shows promising performance benefits for a broad array of hash map-intensive server-side PHP applications, up to 37.6% and averaging 18.81%, improves SPECWeb throughput by 7.71% (banking) and 11.71% (e-commerce).","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"193 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115631684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
POSTER - collective dynamic parallelism for directive based GPU programming languages and compilers 基于指令的GPU编程语言和编译器的集体动态并行性
Guray Ozen, E. Ayguadé, Jesús Labarta
{"title":"POSTER - collective dynamic parallelism for directive based GPU programming languages and compilers","authors":"Guray Ozen, E. Ayguadé, Jesús Labarta","doi":"10.1145/2967938.2974056","DOIUrl":"https://doi.org/10.1145/2967938.2974056","url":null,"abstract":"Early programs for GPU (Graphics Processing Units) acceleration were based on a flat, bulk parallel programming model, in which programs had to perform a sequence of kernel launches from the host CPU. In the latest releases of these devices, dynamic (or nested) parallelism is supported, making possible to launch kernels from threads running on the device, without host intervention. Unfortunately, the overhead of launching kernels from the device is higher compared to launching from the host CPU, making the exploitation of dynamic parallelism unprofitable. This paper proposes and evaluates the basic idea behind a user-directed code transformation technique, named collective dynamic parallelism, that targets the effective exploitation of nested parallelism in modern GPUs. The technique dynamically packs dynamic parallelism kernel invocations and postpones their execution until a bunch of them are available. We show that for sparse matrix vector multiplication, CollectiveDP outperforms well optimized libraries, making GPU useful when matrices are highly irregular.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129793370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
μC-States: Fine-grained GPU datapath power management μC-States:细粒度GPU数据路径电源管理
Onur Kayiran, Adwait Jog, Ashutosh Pattnaik, Rachata Ausavarungnirun, Xulong Tang, M. Kandemir, G. Loh, O. Mutlu, C. Das
{"title":"μC-States: Fine-grained GPU datapath power management","authors":"Onur Kayiran, Adwait Jog, Ashutosh Pattnaik, Rachata Ausavarungnirun, Xulong Tang, M. Kandemir, G. Loh, O. Mutlu, C. Das","doi":"10.1145/2967938.2967941","DOIUrl":"https://doi.org/10.1145/2967938.2967941","url":null,"abstract":"To improve the performance of Graphics Processing Units (GPUs) beyond simply increasing core count, architects are recently adopting a scale-up approach: the peak throughput and individual capabilities of the GPU cores are increasing rapidly. This big-core trend in GPUs leads to various challenges, including higher static power consumption and lower and imbalanced utilization of the datapath components of a big core. As we show in this paper, two key problems ensue: (1) the lower and imbalanced datapath utilization can waste power as an application does not always utilize all portions of the big core datapath, and (2) the use of big cores can lead to application performance degradation in some cases due to the higher memory system contention caused by the more memory requests generated by each big core. This paper introduces a new analysis of datapath component utilization in big-core GPUs based on queuing theory principles. Building on this analysis, we introduce a fine-grained dynamic power- and clock-gating mechanism for the entire datapath, called μC-States, which aims to minimize power consumption by turning off or tuning-down datapath components that are not bottlenecks for the performance of the running application. Our experimental evaluation demonstrates that μC-States significantly reduces both static and dynamic power consumption in a big-core GPU, while also significantly improving the performance of applications affected by high memory system contention. We also show that our analysis of datapath component utilization can guide scheduling and design decisions in a GPU architecture that contains heterogeneous cores.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"91 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129296061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Integrating algorithmic parameters into benchmarking and design space exploration in 3D scene understanding 将算法参数整合到三维场景理解的基准测试和设计空间探索中
Sreekar Shenoy, Bruno Bodin, Luigi Nardi, M. Zia, Harry Wagstaff, Govind Sreekar Shenoy, M. Emani, John Mawer, Christos Kotselidis, A. Nisbet, M. Luján, Björn Franke, P. Kelly, M. O’Boyle
{"title":"Integrating algorithmic parameters into benchmarking and design space exploration in 3D scene understanding","authors":"Sreekar Shenoy, Bruno Bodin, Luigi Nardi, M. Zia, Harry Wagstaff, Govind Sreekar Shenoy, M. Emani, John Mawer, Christos Kotselidis, A. Nisbet, M. Luján, Björn Franke, P. Kelly, M. O’Boyle","doi":"10.1145/2967938.2967963","DOIUrl":"https://doi.org/10.1145/2967938.2967963","url":null,"abstract":"System designers typically use well-studied benchmarks to evaluate and improve new architectures and compilers. We design tomorrow's systems based on yesterday's applications. In this paper we investigate an emerging application, 3D scene understanding, likely to be significant in the mobile space in the near future. Until now, this application could only run in real-time on desktop GPUs. In this work, we examine how it can be mapped to power constrained embedded systems. Key to our approach is the idea of incremental co-design exploration, where optimization choices that concern the domain layer are incrementally explored together with low-level compiler and architecture choices. The goal of this exploration is to reduce execution time while minimizing power and meeting our quality of result objective. As the design space is too large to exhaustively evaluate, we use active learning based on a random forest predictor to find good designs. We show that our approach can, for the first time, achieve dense 3D mapping and tracking in the real-time range within a 1W power budget on a popular embedded device. This is a 4.8× execution time improvement and a 2.8× power reduction compared to the state-of-the-art.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132314023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
POSTER: An integrated vector-scalar design on an in-order ARM core 海报:一个集成的矢量标量设计在一个有序的ARM核心
Milan Stanic, Oscar Palomar, Timothy Hayes, Ivan Ratković, O. Unsal, A. Cristal, M. Valero
{"title":"POSTER: An integrated vector-scalar design on an in-order ARM core","authors":"Milan Stanic, Oscar Palomar, Timothy Hayes, Ivan Ratković, O. Unsal, A. Cristal, M. Valero","doi":"10.1145/2967938.2974057","DOIUrl":"https://doi.org/10.1145/2967938.2974057","url":null,"abstract":"In the low-end mobile processor market, power, energy and area budgets are significantly lower than in other markets (e.g. servers or high-end mobile markets). It has been shown that vector processors are a highly energy-efficient way to increase performance; however adding support for them incurs area and power overheads that would not be acceptable for low-end mobile processors. In this work, we propose an integrated vector-scalar design for the ARM architecture that mostly reuses scalar hardware to support the execution of vector instructions. The key element of the design is our proposed block-based model of execution that groups vector computational instructions together to execute them in a coordinated manner.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131522365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sparso: Context-driven optimizations of sparse linear algebra 稀疏线性代数的上下文驱动优化
Hongbo Rong, Jongsoo Park, Lingxiang Xiang, T. A. Anderson, M. Smelyanskiy
{"title":"Sparso: Context-driven optimizations of sparse linear algebra","authors":"Hongbo Rong, Jongsoo Park, Lingxiang Xiang, T. A. Anderson, M. Smelyanskiy","doi":"10.1145/2967938.2967943","DOIUrl":"https://doi.org/10.1145/2967938.2967943","url":null,"abstract":"The sparse matrix is a key data structure in various domains such as high-performance computing, machine learning, and graph analytics. To maximize performance of sparse matrix operations, it is especially important to optimize across the operations and not just within individual operations. While a straightforward per-operation mapping to library routines misses optimization opportunities, manually optimizing across the boundary of library routines is time-consuming and error-prone, sacrificing productivity. This paper introduces Sparso, a framework that automates such optimizations, enabling both high performance and high productivity. In Sparso, a compiler and sparse linear algebra libraries collaboratively discover and exploit context, which we define as the invariant properties of matrices and relationships between them in a program. We present compiler analyses, namely collective reordering analysis and matrix property discovery, to discover the context. The context discovered from these analyses drives key optimizations across library routines and matrices. We have implemented Sparso with the Julia language, Intel MKL and SpMP libraries. We evaluate our context-driven optimizations in 6 representative sparse linear algebra algorithms. Compared with a baseline that invokes high-performance libraries without context optimizations, Sparso results in 1.2~17x (average 5.7x) speedups. Our approach of compiler-library collaboration and context-driven optimizations should be also applicable to other productivity languages such as Matlab, Python, and R.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121307784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Resource conscious reuse-driven tiling for GPUs 面向 GPU 的资源再利用驱动平铺技术
P. Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, L. Pouchet, A. Rountev, P. Sadayappan
{"title":"Resource conscious reuse-driven tiling for GPUs","authors":"P. Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, L. Pouchet, A. Rountev, P. Sadayappan","doi":"10.1145/2967938.2967967","DOIUrl":"https://doi.org/10.1145/2967938.2967967","url":null,"abstract":"Computations involving successive application of 3D stencil operators are widely used in many application domains, such as image processing, computational electromagnetics, seismic processing, and climate modeling. Enhancement of temporal and spatial locality via tiling is generally required in order to overcome performance bottlenecks due to limited bandwidth to global memory on GPUs. However, the low shared memory capacity on current GPU architectures makes effective tiling for 3D stencils very challenging - several previous domain-specific compilers for stencils have demonstrated very high performance for 2D stencils, but much lower performance on 3D stencils. In this paper, we develop an effective resource-constraint-driven approach for automated GPU code generation for stencils. We present a fusion technique that judiciously fuses stencil computations to minimize data movement, while controlling computational redundancy and maximizing resource usage. The fusion model subsumes time tiling of iterated stencils, and can be easily adapted to different GPU architectures. We integrate the fusion model into a code generator that makes effective use of scarce shared memory and registers to achieve high performance. The effectiveness of the automated model-driven code generator is demonstrated through experimental results on a number of benchmarks, comparing against various previously developed GPU code generators.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130137581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Power tuning HPC jobs on power-constrained systems 在功率受限的系统上对HPC作业进行功率调优
Neha Gholkar, F. Mueller, B. Rountree
{"title":"Power tuning HPC jobs on power-constrained systems","authors":"Neha Gholkar, F. Mueller, B. Rountree","doi":"10.1145/2967938.2967961","DOIUrl":"https://doi.org/10.1145/2967938.2967961","url":null,"abstract":"As we approach the exascale era, power has become a primary bottleneck. The US Department of Energy has set a power constraint of 20MW on each exascale machine. To be able achieve one exaflop under this constraint, it is necessary that we use power intelligently to maximize performance under a power constraint. Most production-level parallel applications that run on a supercomputer are tightly-coupled parallel applications. A naϊve approach of enforcing a power constraint for a parallel job would be to divide the job's power budget uniformly across all the processors. However, previous work has shown that a power capped job suffers from performance variation of otherwise identical processors leading to overall sub-optimal performance. We propose a 2-level hierarchical variation-aware approach of managing power at machine-level. At the macro level, PPartition partitions a machine's power budget across jobs to assign a power budget to each job running on the system such that the machine never exceeds its power budget. At the micro level, PTune makes job-centric decisions by taking the performance variation into account. For every moldable job, PTune determines the optimal number of processors, the selection of processors and the distribution of the job's power budget across them, with the goal of maximizing the job's performance under its power budget. Experiments show that, at the micro level, PTune achieves a performance improvement of up to 29% compared to a naϊve approach. PTune does not lead to any performance degradation, yet frees up almost 40% of the processors for the same performance as that of the naϊve approach under a hard power bound. At the macro level, PPartition is able to achieve a throughput improvement of 5-35% compared to uniform power distribution.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116053232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 75
POSTER - Firestorm: Operating systems for power-constrained architectures 海报- Firestorm:用于功率受限架构的操作系统
S. Panneerselvam, M. Swift
{"title":"POSTER - Firestorm: Operating systems for power-constrained architectures","authors":"S. Panneerselvam, M. Swift","doi":"10.1145/2967938.2975607","DOIUrl":"https://doi.org/10.1145/2967938.2975607","url":null,"abstract":"Moore's law paved the way for doubling the transistors in the same chip area with every generation. However, with the end of Dennard's scaling, voltage and hence the power draw of transistors is no longer dropping proportionally to size. As a result, modern processors cannot use all parts of the processor simultaneously without exceeding the power limit. This manifests as an increasing proportion of dark silicon [4]. In other words, the compute capacity of current and future processors is and will be over-provisioned with respect to the available power. Power limits are influenced by different factors such as the capacity of power distribution infrastructure, battery supply limits, and the thermal capacity of the system. Power limits in datacenters can arise from underprovisioning power distribution units relative to peak power draw. Energy limits are also dictated by the limited capacity of batteries. However, in many systems, the primary limit comes not from the ability to acquire power, but instead from the ability to dissipate power as heat once it has been used. Thermal limits are dictated by the physical properties of the processor materials and also comfort of the user.Thus, power is limited to prevent processor chips from overheating, which can lead to thermal breakdown. As a result, the maximum performance of a system is limited by its cooling capacity, which determines its ability to dissipate heat. Cooling capacity varies across the computing landscape, from servers with external chilled air to desktops with large fans to laptops to fan-less mobile devices.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"167 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132092687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Student research poster: A scalable general purpose system for large-scale graph processing 学生研究海报:用于大规模图形处理的可扩展通用系统
Jiawen Sun
{"title":"Student research poster: A scalable general purpose system for large-scale graph processing","authors":"Jiawen Sun","doi":"10.1145/2967938.2971465","DOIUrl":"https://doi.org/10.1145/2967938.2971465","url":null,"abstract":"Graph analytics is an important and computationally demanding class of data analytics. It is essential to balance scalability, ease-of-use and high performance in large scale graph analytics. As such, it is necessary to hide the complexity of parallelism, data distribution and memory locality behind an abstract interface [2].","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"8 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127102084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信