Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, G. Potron, Nicolas Vasilache
{"title":"Tiling and optimizing time-iterated computations over periodic domains","authors":"Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, G. Potron, Nicolas Vasilache","doi":"10.1145/2628071.2628106","DOIUrl":"https://doi.org/10.1145/2628071.2628106","url":null,"abstract":"This paper deals with optimizing time-iterated computations on periodic data domains. These computations are prevalent in computational sciences, particularly in partial differential equation solvers. We propose a fully automatic technique suitable for implementation in a compiler or in a domain-specific code generator for such computations. Dependence patterns on periodic data domains prevent existing algorithms from finding tiling opportunities. Our approach augments a state-of-the-art parallelization and locality-enhancing algorithm from the polyhedral framework to allow time-tiling of stencil computations on periodic domains. Experimental results on the swim SPEC CPU2000fp benchmark show a speedup of 5× and 4.2× over the highest SPEC performance achieved by native compilers on Intel Xeon and AMD Opteron multicore SMP systems, respectively. On other representative stencil computations, our scheme provides performance similar to that achieved with no periodicity, and a very high speedup is obtained over the native compiler. We also report a mean speedup of about 1.5 χ over a domain-specific stencil compiler supporting limited cases of periodic boundary conditions. To the best of our knowledge, it has been infeasible to manually reproduce such optimizations on swim or any other periodic stencil, especially on a data grid of two-dimensions or higher.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115971429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"XStream: Cross-core spatial streaming based MLC prefetchers for parallel applications in CMPs","authors":"Biswabandan Panda, S. Balachandran","doi":"10.1145/2628071.2628073","DOIUrl":"https://doi.org/10.1145/2628071.2628073","url":null,"abstract":"Hardware prefetchers are commonly used to hide and tolerate off-chip memory latency. Prefetching techniques in the literature are designed for multiple independent sequential applications running on a multicore system. In contrast to multiple independent applications, a single parallel application running on a multicore system exhibits different behavior. In case of a parallel application, cores share and communicate data and code among themselves, and there is commonality in the demand miss streams across multiple cores. This gives an opportunity to predict the demand miss streams and communicate the predicted streams from one core to another, which we refer as cross-core stream communication. We propose cross-core spatial streaming (XStream), a practical and storage-efficient cross-core prefetching technique. XStream detects and predicts the cross-core spatial streams at the private mid level caches (MLCs) and sends the predicted streams in advance to MLC prefetchers of the predicted cores. We compare the effectiveness of XStream with the ideal cross-core spatial streamer. Experimental results demonstrate that, on an average (geomean), compared to the state-of-the-art spatial memory streaming, storage efficient XStream reduces the execution time by 11.3% (as high as 24%) and 9% (as high as 29.09%) for 4-core and 8-core systems respectively.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116269845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stratified sampling for even workload partitioning","authors":"Jeeva Paudel, J. N. Amaral","doi":"10.1145/2628071.2671422","DOIUrl":"https://doi.org/10.1145/2628071.2671422","url":null,"abstract":"This work presents a novel algorithm, Workload Partitioning and Scheduling (WPS), for evenly partitioning the computational workload of large implicitly-defined work-list based applications on distributed/shared-memory systems. WPS uses stratified sampling to estimate the number of work items that will be processed in each step of an application. WPS uses such estimation to evenly partition and distribute the computational workload. An empirical evaluation on large applications — Iterative-Deepening A∗ (IDA∗) applied to (4×4)-Sliding-Tile Puzzles, Delaunay Mesh Generation, and Delaunay Mesh Refinement — shows that WPS is applicable to a range of problems, and yields 28% to 49% speedups over existing work-stealing schedulers alone.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"406 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116285837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RCS: Runtime resource and core scaling for power-constrained multi-core processors","authors":"H. Ghasemi, N. Kim","doi":"10.1145/2628071.2628095","DOIUrl":"https://doi.org/10.1145/2628071.2628095","url":null,"abstract":"Providing a sufficient voltage/frequency (V/F) scaling range is critical for effective power management. However, it has been fraught with decreasing nominal operating voltage and increasing manufacturing process variability that makes it harder to scale the minimum operating voltage (VMIN). In this paper, we first present a resource and core scaling (RCS) technique that jointly scales (i) the resources of a processor and (ii) the number of operating cores to maximize the performance of power-constrained multi-core processors. More specifically, we uniformly scale the resources that are both associated with each core (e.g., L1 caches and execution units (EUs)) and shared by all the cores (e.g., last-level cache (LLC)) as a means to compensate for lack of a V/F scaling range. Under the maximum power constraint, disabling some resources allows us to increase the number of operating cores, and vice versa. We demonstrate that the best RCS configuration for a given application can improve the geometric-mean performance by 21%. Second, we propose a runtime system that predicts the best RCS configuration for a given application and adapts the processor configuration accordingly at runtime. The runtime system only needs to examine a small fraction of runtime to predict the best RCS configuration with accuracy well over 90%, whereas the runtime overhead of prediction and adaptation is small. Finally, we propose to selectively scale the resources in RCS (dubbed sRCS) depending on application's characteristics and demonstrate that sRCS can offer 6% higher geometric-mean performance than RCS that uniformly scales the resources.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123303580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PEMOGEN: Automatic adaptive performance modeling during program runtime","authors":"Arnamoy Bhattacharyya, T. Hoefler","doi":"10.1145/2628071.2628100","DOIUrl":"https://doi.org/10.1145/2628071.2628100","url":null,"abstract":"Traditional means of gathering performance data are tracing, which is limited by the available storage, and profiling, which has limited accuracy. Performance modeling is often used to interpret the tracing data and generate performance predictions. We aim to complement the traditional data collection mechanisms with online performance modeling, a method that generates performance models while the application is running. This allows us to greatly reduce the storage overhead while still producing accurate predictions. We present PEMOGEN, our compilation and modeling framework that automatically instruments applications to generate performance models during program execution. We demonstrate the ability of PEMOGEN to both reduce storage cost and improve the prediction accuracy compared to traditional techniques such as least squares fitting. With our tool, we automatically detect 3,370 kernels from fifteen NAS and Mantevo applications and model their execution time with a median coefficient of variation (R2) of 0.81. These automatically generated performance models can be used to quickly assess the scaling and potential bottlenecks with regards to any input parameter and the number of processes of a parallel application.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122994439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory scheduling towards high-throughput cooperative heterogeneous computing","authors":"Hao Wang, Ripudaman Singh, M. Schulte, N. Kim","doi":"10.1145/2628071.2628096","DOIUrl":"https://doi.org/10.1145/2628071.2628096","url":null,"abstract":"Technology scaling enables the integration of both the CPU and the GPU into a single chip for higher throughput and energy efficiency. In such a single-chip heterogeneous processor (SCHP), its memory bandwidth is the most critically shared resource, requiring judicious management to maximize the throughput. Previous studies on memory scheduling for SCHPs have focused on the scenario where multiple applications are running on the CPU and the GPU respectively, which we denote as a multitasking scenario. However, another increasingly important usage scenario for SCHPs is cooperative heterogeneous computing, where a single parallel application is partitioned between the CPU and the GPU such that the overall throughput is maximized. In previous studies on memory scheduling techniques for chip multi-processors (CMPs) and SCHPs, the first-ready first-come-first-service (FR-FCFS) scheduling policy was used as an inept baseline due to its fairness issue. However, in a cooperative heterogeneous computing scenario, we first demonstrate that FR-FCFS actually offers nearly 10% higher throughput than two recently proposed memory scheduling techniques designed for a multi-tasking scenario. Second, based on our analysis on memory access characteristics in a cooperative heterogeneous computing scenario, we propose various optimization techniques that enhance the row-buffer locality by 10%, reduce the service latency of CPU memory requests by 26%, and improve the overall throughput by up to 8% compared to FR-FCFS.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129023354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sudharsan Jagathrakshakan, Venkata Kalyan Tavva, M. Mutyam
{"title":"Data remapping for an energy efficient burst chop in DRAM memory systems","authors":"Sudharsan Jagathrakshakan, Venkata Kalyan Tavva, M. Mutyam","doi":"10.1145/2628071.2671424","DOIUrl":"https://doi.org/10.1145/2628071.2671424","url":null,"abstract":"In modern day systems, main memory contributes significantly to the overall power consumption. One of the features provided by JEDEC DDR3 standard onwards is Burst Chop (BC) through which the Burst Length of the data access commands (CAS) can be configured. This work aims to improve the energy efficiency of the DRAM memory by exploiting the existing BC features for half writes (writes in which either the first half or second half of the cache block is dirty). We propose to change the mapping of words of a cache block to the DRAM devices in order to reduce the number of devices involved in half writes. With our new mapping, we achieve average memory power savings of 3.27% with negligible impact on performance.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128074916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Piccoli, Henrique Nazaré, R. E. Rodrigues, C. Pousa, E. Borin, Fernando Magno Quintão Pereira
{"title":"Compiler support for selective page migration in NUMA architectures","authors":"G. Piccoli, Henrique Nazaré, R. E. Rodrigues, C. Pousa, E. Borin, Fernando Magno Quintão Pereira","doi":"10.1145/2628071.2628077","DOIUrl":"https://doi.org/10.1145/2628071.2628077","url":null,"abstract":"Current high-performance multicore processors provide users with a non-uniform memory access model (NUMA). These systems perform better when threads access data on memory banks next to the core where they run. However, ensuring data locality is difficult. In this paper, we propose compiler analyses and code generation methods to support a lightweight runtime system that dynamically migrates memory pages to improve data locality. Our technique combines static and dynamic analyses and is capable of identifying the most promising pages to migrate. Statically, we infer the size of arrays, plus the amount of reuse of each memory access instruction in a program. These estimates rely on a simple, yet accurate, trip count predictor of our own design. This knowledge let's us build templates of dynamic checks, to be filled with values known only at runtime. These checks determine when it is profitable to migrate data closer to the processors where this data is used. Our static analyses are quadratic on the number of variables in a program, and the dynamic checks are O(1) in practice. Our technique does not require any form of user intervention, neither the support of a third-party middleware, nor modifications in the operating system's kernel. We have applied our technique on several parallel algorithms, which are completely oblivious to the asymmetric memory topology, and have observed speedups of up to 4×, compared to static heuristics. We compare our approach against Minas, a middleware that supports NUMA-aware data allocation, and show that we can outperform it by up to 50% in some cases.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124917255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Versatile and scalable parallel histogram construction","authors":"Wookeun Jung, Jongsoo Park, Jaejin Lee","doi":"10.1145/2628071.2628108","DOIUrl":"https://doi.org/10.1145/2628071.2628108","url":null,"abstract":"Histograms are used in various fields to quickly profile the distribution of a large amount of data. However, it is challenging to efficiently utilize abundant parallel resources in modern processors for histogram construction. To make matters worse, the most efficient implementation varies depending on input parameters (e.g., input distribution, number of bins, and data type) or architecture parameters (e.g., cache capacity and SIMD width). This paper presents versatile histogram methods that achiev competitive performance across a wide range of input types and target architectures. Our open source implementations are highly optimized for various cases and are scalable for more threads and wider SIMD units. We also show that histogram construction can be significantly accelerated by Intel® Xeon Phi coprocessors for common input data sets because of their compute power from many cores and instructions for efficient vectorization, such as gather-scatter. For histograms with 256 fixed-width bins, a dual-socket 8-core Intel® Xeon® E5-2690 achieves 13 billion bin updates per second (GUPS), while a 60-core Intel® Xeon Phi 5110P coprocessor achieves 18 GUPS for a skewed input. For histograms with 256 variable-width bins, the Xeon processor achieves 4.7 GUPS, while the Xeon Phi coprocessor achieves 9.7 GUPS for a skewed input. For text histogram, or word count, the Xeon processor achieves 342.4 million words per seconds (MWPS). This is 4.12×, 3.46× faster than PHOENIX and TBB. The Xeon phi processor achieves 401.4 MWPS, which is 1.17× faster than the Xeon processor. Since histogram construction captures essential characteristics of more general reduction-heavy operations, our approach can be extended to other settings.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116011728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measuring flexibility in single-ISA heterogeneous processors","authors":"Erik Tomusk, Christophe Dubach, M. O’Boyle","doi":"10.1145/2628071.2628125","DOIUrl":"https://doi.org/10.1145/2628071.2628125","url":null,"abstract":"Single-ISA heterogeneous processors are a promising method for enabling runtime power flexibility. Low-priority programs run on low-power cores, and high-priority programs run on high-power cores. In recent years, a number of methods for heterogeneous design space exploration have emerged. These methods search the design space for Pareto frontiers of cores that are optimal for power and speed. We demonstrate that a heterogeneous processor cannot be composed by simply selecting some cores from a Pareto-optimal set; the selection must give even coverage of the design space. We then define a metric — clumpiness — for measuring how well selected heterogeneous cores cover the design space.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116139538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}