{"title":"Evaluating the Cost of Atomic Operations on Modern Architectures","authors":"Hermann Schweizer, Maciej Besta, T. Hoefler","doi":"10.1109/PACT.2015.24","DOIUrl":"https://doi.org/10.1109/PACT.2015.24","url":null,"abstract":"Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallel programming. Yet, performance tradeoffs between these operations and various characteristics of such systems, such as the structure of caches, are unclear and have not been thoroughly analyzed. In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and bandwidth of different atomics. We consider various state-of-the-art x86 architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer. The results unveil surprising performance relationships between the considered atomics and architectural properties such as the coherence state of the accessed cache lines. One key finding is that all the tested atomics have comparable latency and bandwidth even if they are characterized by different consensus numbers. Another insight is that the design of atomics prevents any instruction-level parallelism even if there are no dependencies between the issued operations. Finally, we discuss solutions to the discovered performance issues in the analyzed architectures. Our analysis can be used for making better design and algorithmic decisions in parallel programming on various architectures deployed in both off-the-shelf machines and large compute systems.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"86 26 Pt 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126290260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ivan Jibaja, P. Jensen, Ningxin Hu, M. Haghighat, J. McCutchan, D. Gohman, S. Blackburn, K. McKinley
{"title":"Vector Parallelism in JavaScript: Language and Compiler Support for SIMD","authors":"Ivan Jibaja, P. Jensen, Ningxin Hu, M. Haghighat, J. McCutchan, D. Gohman, S. Blackburn, K. McKinley","doi":"10.1109/PACT.2015.33","DOIUrl":"https://doi.org/10.1109/PACT.2015.33","url":null,"abstract":"JavaScript is the most widely used web programming language and is increasingly used to implement sophisticated and demanding applications in such domains as graphics, games, video, and cryptography. The performance and energy usage of these applications can benefit from hardware parallelism, including SIMD (Single Instruction, Multiple Data) vector parallel instructions. JavaScript's current support for parallelism is, however, limited and does not directly utilize SIMD capabilities. This paper presents the design and implementation of SIMD language extensions and compiler support that together add fine-grain vector parallelism to JavaScript. The specification for this language extension is in final stages of adoption by the JavaScript standardization committee and our compiler support is available in two open-source production browsers. The design principles seek portability, SIMD performance portability on various SIMD architectures, and compiler simplicity to ease adoption. The design does not require automatic vectorization compiler technology, but does not preclude it either. The SIMD extensions define immutable fixed-length SIMD data types and operations that are common to both ARM and x86 ISAs. The contributions of this work include type speculation and optimizations that generate minimal numbers of SIMD native instructions from high-level JavaScript SIMD instructions. We implement type speculation, optimizations, and code generation in two open-source JavaScript VMs and measure performance improvements between a factor of 1.7× to 8.9× with an average of 3.3× and average energy improvements of 2.9× on micro benchmarks and key graphics kernels on various hardware, browsers, and operating systems. These portable SIMD language extensions significantly improve compute-intensive interactive applications in the browser, such as games and media processing, by exploiting vector parallelism without relying on automatic vectorizing compiler technology, non-portable native code, or plugins.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131062174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Communication Avoiding Algorithms: Analysis and Code Generation for Parallel Systems","authors":"K. Murthy, J. Mellor-Crummey","doi":"10.1109/PACT.2015.41","DOIUrl":"https://doi.org/10.1109/PACT.2015.41","url":null,"abstract":"Data movement is a critical bottleneck for future generations of parallel systems. The class of .5D communication-avoiding algorithms were developed to address this bottleneck. These algorithms reduce communication and provide strong scaling in both time and energy. As a firststep towards automating the development of communication-avoiding-libraries, we developed the Maunam compiler. Maunam generates efficient parallel code from a high-level, global view sketch of .5D algorithms that are expressed using symbolic data sizes and numbers of processors. It supports the expression of data movement and communication through-high-level global operations such as TILT and CSHIFT as well as through element-wise copy operations. With the latter, wrap around communication patterns can also be achieved using subscripts based on modulo operations. Maunam employs polyhedral analysis to reason about communication and computation present in a .5D algorithm. After partitioning data and computation, it inserts point-to-point-and collective communication as needed. Maunam also analyzes data dependence patterns and data layouts to identify reductions over processor subsets. Maunam-generated Fortran+MPI code for 2.5D matrix multiplication running on 4096 cores of a Cray XC30 super computer achieves 59 TFlops/s (76% of the machine peak). Our generated parallel code achieves 91% of the performance of a hand-coded version.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132323424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Narges Shahidi, Anand Sivasubramanian, M. Kandemir, C. Das
{"title":"Storage Consolidation on SSDs: Not Always a Panacea, but Can We Ease the Pain?","authors":"Narges Shahidi, Anand Sivasubramanian, M. Kandemir, C. Das","doi":"10.1109/PACT.2015.61","DOIUrl":"https://doi.org/10.1109/PACT.2015.61","url":null,"abstract":"Storage Consolidation is increasing being adopted to reduce system costs, simplify the storage infrastructure, and enhance availability and resource management. However, consolidation leads to interference in shared resources. This poster shows the effect of consolidation on performance of co-located applications and proposes a static approach to reduce interference.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114434935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Practical Near-Data Processing for In-Memory Analytics Frameworks","authors":"Mingyu Gao, Grant Ayers, C. Kozyrakis","doi":"10.1109/PACT.2015.22","DOIUrl":"https://doi.org/10.1109/PACT.2015.22","url":null,"abstract":"The end of Dennard scaling has made all systemsenergy-constrained. For data-intensive applications with limitedtemporal locality, the major energy bottleneck is data movementbetween processor chips and main memory modules. For such workloads, the best way to optimize energy is to place processing near the datain main memory. Advances in 3D integrationprovide an opportunity to implement near-data processing (NDP) withoutthe technology problems that similar efforts had in the past. This paper develops the hardware and software of an NDP architecturefor in-memory analytics frameworks, including MapReduce, graphprocessing, and deep neural networks. We develop simple but scalablehardware support for coherence, communication, and synchronization, anda runtime system that is sufficient to support analytics frameworks withcomplex data patterns while hiding all thedetails of the NDP hardware. Our NDP architecture provides up to 16x performance and energy advantageover conventional approaches, and 2.5x over recently-proposed NDP systems. We also investigate the balance between processing and memory throughput, as well as the scalability and physical and logical organization of the memory system. Finally, we show that it is critical to optimize software frameworksfor spatial locality as it leads to 2.9x efficiency improvements for NDP.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116964144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Efficient Vectorization Approach to Nested Thread-level Parallelism for CUDA GPUs","authors":"Shixiong Xu, David Gregg","doi":"10.1109/PACT.2015.56","DOIUrl":"https://doi.org/10.1109/PACT.2015.56","url":null,"abstract":"Nested thread-level parallelism (TLP) is pervasive in real applications. For example, 75% (14 out of 19) of the applications in the Rodinia benchmark for heterogeneous accelerators contain kernels with nested thread-level parallelism. Efficiently mapping the enclosed nested parallelism to the GPU threads in the C-to-CUDA compilation (OpenACC in this paper) is becoming more and more important. This mapping problem is two folds: suitable execution models and efficient mapping strategies of the nested parallelism.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"185 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123292256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arnamoy Bhattacharyya, Grzegorz Kwasniewski, T. Hoefler
{"title":"Using Compiler Techniques to Improve Automatic Performance Modeling","authors":"Arnamoy Bhattacharyya, Grzegorz Kwasniewski, T. Hoefler","doi":"10.1109/PACT.2015.39","DOIUrl":"https://doi.org/10.1109/PACT.2015.39","url":null,"abstract":"Performance modeling can be utilized in a number of scenarios, starting from finding performance bugs to the scalability study of applications. Existing dynamic and static approaches for automating the generation of performance models have limitations for precision and overhead. In this work, we explore combination of a number of static and dynamic analyses for life-long performance modeling and investigate accuracy, reduction of the model search space, and performance improvements over previous approaches on a wide range of parallel benchmarks. We develop static and dynamic schemes such as kernel clustering, batched model updates and regulation of modeling frequency for reducing the cost of measurements, model generation, and updates. Our hybrid approach, on average can improve the accuracy of the performance models by 4.3%(maximum 10%) and can reduce the overhead by 25% (maximum 65%) as compared to previous approaches.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129129999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, G. Loh, C. Das, M. Kandemir, O. Mutlu
{"title":"Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance","authors":"Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, G. Loh, C. Das, M. Kandemir, O. Mutlu","doi":"10.1109/PACT.2015.38","DOIUrl":"https://doi.org/10.1109/PACT.2015.38","url":null,"abstract":"In a GPU, all threads within a warp execute the same instruction in lockstep. For a memory instruction, this can lead to memory divergence: the memory requests for some threads are serviced early, while the remaining requests incur long latencies. This divergence stalls the warp, as it cannot execute the next instruction until all requests from the current instruction complete. In this work, we make three new observations. First, GPGPU warps exhibit heterogeneous memory divergence behavior at the shared cache: some warps have most of their requests hit in the cache (high cache utility), while other warps see most of their request miss (low cache utility). Second, a warp retains the same divergence behavior for long periods of execution. Third, due to high memory level parallelism, requests going to the shared cache can incur queuing delays as large as hundreds of cycles, exacerbating the effects of memory divergence. We propose a set of techniques, collectively called Memory Divergence Correction (MeDiC), that reduce the negative performance impact of memory divergence and cache queuing. MeDiC uses warp divergence characterization to guide three components: (1) a cache bypassing mechanism that exploits the latency tolerance of low cache utility warps to both alleviate queuing delay and increase the hit rate for high cache utility warps, (2) a cache insertion policy that prevents data from highcache utility warps from being prematurely evicted, and (3) a memory controller that prioritizes the few requests received from high cache utility warps to minimize stall time. We compare MeDiC to four cache management techniques, and find that it delivers an average speedup of 21.8%, and 20.1% higher energy efficiency, over a state-of-the-art GPU cache management mechanism across 15 different GPGPU applications.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129945695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmad Hassan, H. Vandierendonck, Dimitrios S. Nikolopoulos
{"title":"Energy-Efficient Hybrid DRAM/NVM Main Memory","authors":"Ahmad Hassan, H. Vandierendonck, Dimitrios S. Nikolopoulos","doi":"10.1109/PACT.2015.58","DOIUrl":"https://doi.org/10.1109/PACT.2015.58","url":null,"abstract":"DRAM consumes significant static energy both in active and idle state due to continuous leakage and refresh power. Various byte-addressable non-volatile memory (NVM) technologies promise near-zero static energy and persistence, however they suffer from increased latency and increased dynamic energy than DRAM. A hybrid main memory, containing both DRAM and NVM components, can provide both low energy and high performance although such organizations require that data is placed in the appropriate component. We propose a user-level software management methodology for a hybrid DRAM/NVM main memory system with an aim to reduce energy.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132851352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable SIMD-Efficient Graph Processing on GPUs","authors":"Farzad Khorasani, Rajiv Gupta, L. Bhuyan","doi":"10.1109/PACT.2015.15","DOIUrl":"https://doi.org/10.1109/PACT.2015.15","url":null,"abstract":"The vast computing power of GPUs makes them an attractive platform for accelerating large scale data parallel computations such as popular graph processing applications. However, the inherent irregularity and large sizes of real-world power law graphs makes effective use of GPUs a major challenge. In this paper we develop techniques that greatly enhance the performance and scalability of vertex-centric graph processing on GPUs. First, we present Warp Segmentation, a novel method that greatly enhances GPU device utilization by dynamically assigning appropriate number of SIMD threads to process a vertex with irregular-sized neighbors while employing compact CSR representation to maximize the graph size that can be kept inside the GPU global memory. Prior works can either maximize graph sizes (VWC uses the CSR representation) or device utilization (e.g., CuSha uses the CW representation, however, CW is roughly 2.5x the size of CSR). Second, we further scale graph processing to make use of multiple GPUs while proposing Vertex Refinement to address the challenge of judiciously using the limited bandwidth available for transferring data between GPUs via the PCIe bus. Vertex refinement employs parallel binary prefix sum to dynamically collect only the updated boundary vertices inside GPUs' outbox buffers for dramatically reducing inter-GPU data transfer volume. Whereas existing multi-GPU techniques (Medusa, TOTEM) perform high degree of wasteful vertex transfers. On a single GPU, our framework delivers average speedups of 1.29x to 2.80x over VWC. When scaled to multiple GPUs, our framework achieves up to 2.71x performance improvement compared to inter-GPU vertex communication schemes used by other multi-GPU techniques (i.e., Medusa, TOTEM).","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115305645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}