2014 23rd International Conference on Parallel Architecture and Compilation (PACT)最新文献_第5页

Tiling and optimizing time-iterated computations over periodic domains 在周期域上平铺和优化时间迭代计算

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628106

Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, G. Potron, Nicolas Vasilache

{"title":"Tiling and optimizing time-iterated computations over periodic domains","authors":"Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, G. Potron, Nicolas Vasilache","doi":"10.1145/2628071.2628106","DOIUrl":"https://doi.org/10.1145/2628071.2628106","url":null,"abstract":"This paper deals with optimizing time-iterated computations on periodic data domains. These computations are prevalent in computational sciences, particularly in partial differential equation solvers. We propose a fully automatic technique suitable for implementation in a compiler or in a domain-specific code generator for such computations. Dependence patterns on periodic data domains prevent existing algorithms from finding tiling opportunities. Our approach augments a state-of-the-art parallelization and locality-enhancing algorithm from the polyhedral framework to allow time-tiling of stencil computations on periodic domains. Experimental results on the swim SPEC CPU2000fp benchmark show a speedup of 5× and 4.2× over the highest SPEC performance achieved by native compilers on Intel Xeon and AMD Opteron multicore SMP systems, respectively. On other representative stencil computations, our scheme provides performance similar to that achieved with no periodicity, and a very high speedup is obtained over the native compiler. We also report a mean speedup of about 1.5 χ over a domain-specific stencil compiler supporting limited cases of periodic boundary conditions. To the best of our knowledge, it has been infeasible to manually reproduce such optimizations on swim or any other periodic stencil, especially on a data grid of two-dimensions or higher.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115971429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

XStream: Cross-core spatial streaming based MLC prefetchers for parallel applications in CMPs XStream:用于cmp中并行应用的基于跨核空间流的MLC预取器

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628073

Biswabandan Panda, S. Balachandran

{"title":"XStream: Cross-core spatial streaming based MLC prefetchers for parallel applications in CMPs","authors":"Biswabandan Panda, S. Balachandran","doi":"10.1145/2628071.2628073","DOIUrl":"https://doi.org/10.1145/2628071.2628073","url":null,"abstract":"Hardware prefetchers are commonly used to hide and tolerate off-chip memory latency. Prefetching techniques in the literature are designed for multiple independent sequential applications running on a multicore system. In contrast to multiple independent applications, a single parallel application running on a multicore system exhibits different behavior. In case of a parallel application, cores share and communicate data and code among themselves, and there is commonality in the demand miss streams across multiple cores. This gives an opportunity to predict the demand miss streams and communicate the predicted streams from one core to another, which we refer as cross-core stream communication. We propose cross-core spatial streaming (XStream), a practical and storage-efficient cross-core prefetching technique. XStream detects and predicts the cross-core spatial streams at the private mid level caches (MLCs) and sends the predicted streams in advance to MLC prefetchers of the predicted cores. We compare the effectiveness of XStream with the ideal cross-core spatial streamer. Experimental results demonstrate that, on an average (geomean), compared to the state-of-the-art spatial memory streaming, storage efficient XStream reduces the execution time by 11.3% (as high as 24%) and 9% (as high as 29.09%) for 4-core and 8-core systems respectively.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116269845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Stratified sampling for even workload partitioning 分层抽样以实现均匀的工作负载分区

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2671422

Jeeva Paudel, J. N. Amaral

引用次数: 2

RCS: Runtime resource and core scaling for power-constrained multi-core processors RCS:针对功率受限的多核处理器的运行时资源和核心扩展

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628095

H. Ghasemi, N. Kim

{"title":"RCS: Runtime resource and core scaling for power-constrained multi-core processors","authors":"H. Ghasemi, N. Kim","doi":"10.1145/2628071.2628095","DOIUrl":"https://doi.org/10.1145/2628071.2628095","url":null,"abstract":"Providing a sufficient voltage/frequency (V/F) scaling range is critical for effective power management. However, it has been fraught with decreasing nominal operating voltage and increasing manufacturing process variability that makes it harder to scale the minimum operating voltage (VMIN). In this paper, we first present a resource and core scaling (RCS) technique that jointly scales (i) the resources of a processor and (ii) the number of operating cores to maximize the performance of power-constrained multi-core processors. More specifically, we uniformly scale the resources that are both associated with each core (e.g., L1 caches and execution units (EUs)) and shared by all the cores (e.g., last-level cache (LLC)) as a means to compensate for lack of a V/F scaling range. Under the maximum power constraint, disabling some resources allows us to increase the number of operating cores, and vice versa. We demonstrate that the best RCS configuration for a given application can improve the geometric-mean performance by 21%. Second, we propose a runtime system that predicts the best RCS configuration for a given application and adapts the processor configuration accordingly at runtime. The runtime system only needs to examine a small fraction of runtime to predict the best RCS configuration with accuracy well over 90%, whereas the runtime overhead of prediction and adaptation is small. Finally, we propose to selectively scale the resources in RCS (dubbed sRCS) depending on application's characteristics and demonstrate that sRCS can offer 6% higher geometric-mean performance than RCS that uniformly scales the resources.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123303580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

PEMOGEN: Automatic adaptive performance modeling during program runtime PEMOGEN:在程序运行时自动自适应性能建模

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628100

Arnamoy Bhattacharyya, T. Hoefler

{"title":"PEMOGEN: Automatic adaptive performance modeling during program runtime","authors":"Arnamoy Bhattacharyya, T. Hoefler","doi":"10.1145/2628071.2628100","DOIUrl":"https://doi.org/10.1145/2628071.2628100","url":null,"abstract":"Traditional means of gathering performance data are tracing, which is limited by the available storage, and profiling, which has limited accuracy. Performance modeling is often used to interpret the tracing data and generate performance predictions. We aim to complement the traditional data collection mechanisms with online performance modeling, a method that generates performance models while the application is running. This allows us to greatly reduce the storage overhead while still producing accurate predictions. We present PEMOGEN, our compilation and modeling framework that automatically instruments applications to generate performance models during program execution. We demonstrate the ability of PEMOGEN to both reduce storage cost and improve the prediction accuracy compared to traditional techniques such as least squares fitting. With our tool, we automatically detect 3,370 kernels from fifteen NAS and Mantevo applications and model their execution time with a median coefficient of variation (R2) of 0.81. These automatically generated performance models can be used to quickly assess the scaling and potential bottlenecks with regards to any input parameter and the number of processes of a parallel application.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122994439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

Memory scheduling towards high-throughput cooperative heterogeneous computing 面向高吞吐量协同异构计算的内存调度

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628096

Hao Wang, Ripudaman Singh, M. Schulte, N. Kim

{"title":"Memory scheduling towards high-throughput cooperative heterogeneous computing","authors":"Hao Wang, Ripudaman Singh, M. Schulte, N. Kim","doi":"10.1145/2628071.2628096","DOIUrl":"https://doi.org/10.1145/2628071.2628096","url":null,"abstract":"Technology scaling enables the integration of both the CPU and the GPU into a single chip for higher throughput and energy efficiency. In such a single-chip heterogeneous processor (SCHP), its memory bandwidth is the most critically shared resource, requiring judicious management to maximize the throughput. Previous studies on memory scheduling for SCHPs have focused on the scenario where multiple applications are running on the CPU and the GPU respectively, which we denote as a multitasking scenario. However, another increasingly important usage scenario for SCHPs is cooperative heterogeneous computing, where a single parallel application is partitioned between the CPU and the GPU such that the overall throughput is maximized. In previous studies on memory scheduling techniques for chip multi-processors (CMPs) and SCHPs, the first-ready first-come-first-service (FR-FCFS) scheduling policy was used as an inept baseline due to its fairness issue. However, in a cooperative heterogeneous computing scenario, we first demonstrate that FR-FCFS actually offers nearly 10% higher throughput than two recently proposed memory scheduling techniques designed for a multi-tasking scenario. Second, based on our analysis on memory access characteristics in a cooperative heterogeneous computing scenario, we propose various optimization techniques that enhance the row-buffer locality by 10%, reduce the service latency of CPU memory requests by 26%, and improve the overall throughput by up to 8% compared to FR-FCFS.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129023354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Data remapping for an energy efficient burst chop in DRAM memory systems DRAM存储系统中高能效突发斩波的数据重映射

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2671424

Sudharsan Jagathrakshakan, Venkata Kalyan Tavva, M. Mutyam

引用次数: 3

Compiler support for selective page migration in NUMA architectures 编译器对NUMA架构中选择性页面迁移的支持

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628077

G. Piccoli, Henrique Nazaré, R. E. Rodrigues, C. Pousa, E. Borin, Fernando Magno Quintão Pereira

{"title":"Compiler support for selective page migration in NUMA architectures","authors":"G. Piccoli, Henrique Nazaré, R. E. Rodrigues, C. Pousa, E. Borin, Fernando Magno Quintão Pereira","doi":"10.1145/2628071.2628077","DOIUrl":"https://doi.org/10.1145/2628071.2628077","url":null,"abstract":"Current high-performance multicore processors provide users with a non-uniform memory access model (NUMA). These systems perform better when threads access data on memory banks next to the core where they run. However, ensuring data locality is difficult. In this paper, we propose compiler analyses and code generation methods to support a lightweight runtime system that dynamically migrates memory pages to improve data locality. Our technique combines static and dynamic analyses and is capable of identifying the most promising pages to migrate. Statically, we infer the size of arrays, plus the amount of reuse of each memory access instruction in a program. These estimates rely on a simple, yet accurate, trip count predictor of our own design. This knowledge let's us build templates of dynamic checks, to be filled with values known only at runtime. These checks determine when it is profitable to migrate data closer to the processors where this data is used. Our static analyses are quadratic on the number of variables in a program, and the dynamic checks are O(1) in practice. Our technique does not require any form of user intervention, neither the support of a third-party middleware, nor modifications in the operating system's kernel. We have applied our technique on several parallel algorithms, which are completely oblivious to the asymmetric memory topology, and have observed speedups of up to 4×, compared to static heuristics. We compare our approach against Minas, a middleware that supports NUMA-aware data allocation, and show that we can outperform it by up to 50% in some cases.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124917255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Versatile and scalable parallel histogram construction 通用和可扩展的并行直方图结构

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628108

Wookeun Jung, Jongsoo Park, Jaejin Lee

{"title":"Versatile and scalable parallel histogram construction","authors":"Wookeun Jung, Jongsoo Park, Jaejin Lee","doi":"10.1145/2628071.2628108","DOIUrl":"https://doi.org/10.1145/2628071.2628108","url":null,"abstract":"Histograms are used in various fields to quickly profile the distribution of a large amount of data. However, it is challenging to efficiently utilize abundant parallel resources in modern processors for histogram construction. To make matters worse, the most efficient implementation varies depending on input parameters (e.g., input distribution, number of bins, and data type) or architecture parameters (e.g., cache capacity and SIMD width). This paper presents versatile histogram methods that achiev competitive performance across a wide range of input types and target architectures. Our open source implementations are highly optimized for various cases and are scalable for more threads and wider SIMD units. We also show that histogram construction can be significantly accelerated by Intel® Xeon Phi coprocessors for common input data sets because of their compute power from many cores and instructions for efficient vectorization, such as gather-scatter. For histograms with 256 fixed-width bins, a dual-socket 8-core Intel® Xeon® E5-2690 achieves 13 billion bin updates per second (GUPS), while a 60-core Intel® Xeon Phi 5110P coprocessor achieves 18 GUPS for a skewed input. For histograms with 256 variable-width bins, the Xeon processor achieves 4.7 GUPS, while the Xeon Phi coprocessor achieves 9.7 GUPS for a skewed input. For text histogram, or word count, the Xeon processor achieves 342.4 million words per seconds (MWPS). This is 4.12×, 3.46× faster than PHOENIX and TBB. The Xeon phi processor achieves 401.4 MWPS, which is 1.17× faster than the Xeon processor. Since histogram construction captures essential characteristics of more general reduction-heavy operations, our approach can be extended to other settings.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116011728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Measuring flexibility in single-ISA heterogeneous processors 衡量单isa异构处理器的灵活性

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628125

Erik Tomusk, Christophe Dubach, M. O’Boyle

引用次数: 6