{"title":"Design of a hybrid MPI-CUDA benchmark suite for CPU-GPU clusters","authors":"T. Agarwal, M. Becchi","doi":"10.1145/2628071.2671423","DOIUrl":"https://doi.org/10.1145/2628071.2671423","url":null,"abstract":"In the last few years, GPUs have become an integral part of HPC clusters. To test these heterogeneous CPU-GPU systems, we designed a hybrid CUDA-MPI benchmark suite that consists of three communication- and compute-intensive applications: Matrix Multiplication (MM), Needleman-Wunsch (NW) and the ADFA compression algorithm [1]. The main goal of this work is to characterize these workloads on CPU-GPU clusters. Our benchmark applications are designed to allow cluster administrators to identify bottlenecks in the cluster, to decide if scaling applications to multiple nodes would improve or decrease overall throughput and to design effective scheduling policies. Our experiments show that inter-node communication can significantly degrade the throughput of communication-intensive applications. We conclude that the scalability of the applications depends primarily on two factors: the cluster configuration and the applications characteristics.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131170302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graph-based performance accounting for chip multiprocessor memory systems","authors":"Magnus Jahre","doi":"10.1145/2628071.2628111","DOIUrl":"https://doi.org/10.1145/2628071.2628111","url":null,"abstract":"Chip Multiprocessor (CMP) memory systems share memory system resources between processor cores. While this sharing enables good resource utilization and fast inter-processor communication, it also makes the performance of an application depend on its co-runners. This breaks the system software assumption that a process has the same rate of progress regardless of the co-schedule, potentially leading to priority inversion, missed deadlines, unpredictable interactive performance and non-compliance with service level agreements. In this work, we present a novel graph-based technique that accurately estimates the performance an application would experience without memory system interference. Dynamic interference-free performance estimates can enable scheduling algorithms and management policies that optimize directly for system performance metrics.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131856072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EFetch: Optimizing instruction fetch for event-driven web applications","authors":"Gaurav Chadha, S. Mahlke, S. Narayanasamy","doi":"10.1145/2628071.2628103","DOIUrl":"https://doi.org/10.1145/2628071.2628103","url":null,"abstract":"Web 2.0 applications written in JavaScript are increasingly popular as they are easy to use, easy to update and maintain, and portable across a wide variety of computing platforms. Web applications receive frequent input from a rich array of sensors, network, and user input modalities. To handle the resulting asynchrony due to these inputs, web applications are developed using an event-driven programming model. These event-driven web applications have dramatically different characteristics, which provides an opportunity to create a customized processor core to improve the responsiveness of web applications. In this paper, we take one step towards creating a core customized to event-driven applications. We observe that instruction cache misses of web applications are substantially higher than conventional server and desktop workloads due to large working sets caused by distant re-use. To mitigate this bottleneck, we propose an instruction prefetcher (EFetch) that is tuned to exploit the characteristics of web applications. We find that an event signature, which captures the current event and function calling context, is a good predictor of the control flow inside a function of an event-driven program. It allows us to accurately predict a function's callees and their function bodies and prefetch them in a timely manner. For a set of real-world web applications, we show that the proposed prefetcher outperforms commonly implemented next-2-line prefetcher by 17%. Also, it consumes 5.2 times less area than a recently proposed prefetcher, while outperforming it.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124278057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From petascale to the pocket: Adaptively scaling parallel programs for mobile SoCs","authors":"Adam Fidel, N. Amato, Lawrence Rauchwerger","doi":"10.1145/2628071.2671426","DOIUrl":"https://doi.org/10.1145/2628071.2671426","url":null,"abstract":"With resource-constrained mobile and embedded devices being outfitted with multicore processors, there exists a need to allow existing parallel programs to be scaled down to efficiently utilize these devices. We study the marriage of programming models originally designed for distributed-memory supercomputers with smaller scale parallel architectures that are shared-memory and generally resource-constrained.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124895258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeSTM: Harnessing determinism in STMs for application development","authors":"K. Ravichandran, Ada Gavrilovska, S. Pande","doi":"10.1145/2628071.2628094","DOIUrl":"https://doi.org/10.1145/2628071.2628094","url":null,"abstract":"Non-determinism has long been recognized as one of the key challenges which restrict parallel programmer productivity by complicating several phases of application development. While Software Transactional Memory (STM) systems have greatly improved the productivity of programmers developing parallel applications in a host of areas they still exhibit non-deterministic behavior leading to decreased productivity. While determinism in parallel applications which use traditional synchronization primitives (such as locks) has been relatively well studied, its interplay with STMs has not. In this paper we present DeSTM, a deterministic STM, which allows programmers to leverage determinism through the implementation, debugging and testing phases of application development. In this work we first adapt techniques which introduce determinism in applications which use traditional synchronization (such as locks) to work in conjunction with certain STMs. As one would expect, this does lead to performance degradation over a non-deterministic execution. Next we present, DeSTM, which uses novel techniques exploiting the properties of these STMs to dramatically improve the performance of deterministic executions. Further, DeSTM allows programmers to randomly change the deterministic schedule in a controlled fashion giving programmers access to a wide variety of execution schedules during application development. We demonstrate our approach on the STAMP benchmark suite. We first study the overheads that determinism introduces in STM applications and then demonstrate how DeSTM is able to improve performance of deterministic execution significantly, by over 50% in some applications and on average by about 35%. DeSTM also actually helped us detect, what we believe is a bug, in one of the benchmarks. Further, our approach is programmer friendly requiring no changes to application code.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"35 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126057919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using STT-RAM to enable energy-efficient near-threshold chip multiprocessors","authors":"Xiang Pan, R. Teodorescu","doi":"10.1145/2628071.2628132","DOIUrl":"https://doi.org/10.1145/2628071.2628132","url":null,"abstract":"Near-threshold computing is gaining traction as an energy-efficient solution for power-constrained systems. This paper proposes a novel near-threshold chip multiprocessor design that uses non-volatile spin-transfer torque random access memory (STT-RAM) technology to implement all on-chip caches. This technology has several advantages over SRAM that are particularly useful in near-threshold designs. Primarily, STT-RAM has very low leakage, saving a substantial fraction of the power consumed by near-threshold chips. In addition, the STT-RAM components run at a higher supply voltage to speed up write operations. This has the effect of making cache reads very fast to the point where L1 caches can be shared by several cores, improving performance. Overall, the proposed design saves 11–33% energy compared to an SRAM-based near-threshold system.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125327993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Kaleem, R. Barik, T. Shpeisman, B. Lewis, Chunling Hu, K. Pingali
{"title":"Adaptive heterogeneous scheduling for integrated GPUs","authors":"R. Kaleem, R. Barik, T. Shpeisman, B. Lewis, Chunling Hu, K. Pingali","doi":"10.1145/2628071.2628088","DOIUrl":"https://doi.org/10.1145/2628071.2628088","url":null,"abstract":"Many processors today integrate a CPU and GPU on the same die, which allows them to share resources like physical memory and lowers the cost of CPU-GPU communication. As a consequence, programmers can effectively utilize both the CPU and GPU to execute a single application. This paper presents novel adaptive scheduling techniques for integrated CPU-GPU processors. We present two online profiling-based scheduling algorithms: naïve and asymmetric. Our asymmetric scheduling algorithm uses low-overhead online profiling to automatically partition the work of dataparallel kernels between the CPU and GPU without input from application developers. It does profiling on the CPU and GPU in a way that doesn't penalize GPU-centric workloads that run significantly faster on the GPU. It adapts to application characteristics by addressing: 1) load imbalance via irregularity caused by, e.g., data-dependent control flow, 2) different amounts of work on each kernel call, and 3) multiple kernels with different characteristics. Unlike many existing approaches primarily targeting NVIDIA discrete GPUs, our scheduling algorithm does not require offline processing. We evaluate our asymmetric scheduling algorithm on a desktop system with an Intel 4th Generation Core Processor using a set of sixteen regular and irregular workloads from diverse application areas. On average, our asymmetric scheduling algorithm performs within 3.2% of the maximum throughput with a CPU-and-GPU oracle that always chooses the best work partitioning between the CPU and GPU. These results underscore the feasibility of online profile-based heterogeneous scheduling on integrated CPU-GPU processors.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114388419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CAWS: Criticality-aware warp scheduling for GPGPU workloads","authors":"Shin-Ying Lee, Carole-Jean Wu","doi":"10.1145/2628071.2628107","DOIUrl":"https://doi.org/10.1145/2628071.2628107","url":null,"abstract":"The ability to perform fast context-switching and massive multi-threading is the forte of modern GPU architectures, which have emerged as an efficient alternative to traditional chip-multiprocessors for parallel workloads. One of the main benefits of such architecture is its latency-hiding capability. However, the efficacy of GPU's latency-hiding varies significantly across GPGPU applications. To investigate this, this paper first proposes a new algorithm that profiles execution behavior of GPGPU applications. We characterize latencies caused by various pipeline hazards, memory accesses, synchronization primitives, and the warp scheduler. Our results show that the current round-robin warp scheduler works well in overlapping various latency stalls with the execution of other available warps for only a few GPGPU applications. For other applications, there is an excessive latency stall that cannot be hidden by the scheduler effectively. With the latency characterization insight, we observe a significant execution time disparity for warps within the same thread block, which causes suboptimal performance, called the warp criticality problem. To tackle the warp criticality problem, we design a family of criticality-aware warp scheduling (CAWS) policies by scheduling the critical warp(s) more frequently than other warps. Our results on the breadth-first-search, B+tree search, two point angular correlation function, and K-means clustering show that, with oracle knowledge of warp criticality, our best-performing scheduling policy can improve GPGPU applications' performance by 17% on average. With our designed criticality predictor, the various scheduling policies can improve performance by 10–21% on breadth-first-search. To our knowledge, this is the first paper to characterize warp criticality and explore different criticality-aware warp scheduling policies for GPGPU workloads.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130150630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Ding, M. Kandemir, D. Guttman, Adwait Jog, C. Das, Praveen Yedlapalli
{"title":"Trading cache hit rate for memory performance","authors":"W. Ding, M. Kandemir, D. Guttman, Adwait Jog, C. Das, Praveen Yedlapalli","doi":"10.1145/2628071.2628082","DOIUrl":"https://doi.org/10.1145/2628071.2628082","url":null,"abstract":"Most of the prior compiler based data locality optimization works target exclusively cache locality optimization, and row-buffer locality in DRAM banks received much less attention. In particular, to the best of our knowledge, there is no single compiler based approach that can improve row-buffer locality in executing irregular applications. This presents a critical problem considering the fact that executing irregular applications in a power and performance efficient manner will be a key requirement to extract maximum benefits from emerging multicore machines and exascale systems. Motivated by these observations, this paper makes the following contributions. First, it presents a compiler-runtime cooperative data layout optimization approach that takes as input an irregular program that has already been optimized for cache locality and generates an output code with the same cache performance but better row-buffer locality (lower number of row-buffer misses). Second, it discusses a more aggressive strategy that sacrifices some cache performance in order to further improve row-buffer performance (i.e., it trades cache performance for memory system performance). The ultimate goal of this strategy is to find the right tradeoff point between cache performance and row-buffer performance so that the overall application performance is improved. Third, the paper performs a detailed evaluation of these two approaches using both an AMD Opteron based multicore system and a multicore simulator. The experimental results, collected using five real-world irregular applications, show that (i) conventional cache optimizations do not improve row-buffer locality significantly; (ii) our first approach achieves about 9.8% execution time improvement by keeping the number of cache misses the same as a cache-optimized code but reducing the number of row-buffer misses; and (iii) our second approach achieves even higher execution time improvements (13.8% on average) by sacrificing cache performance for additional memory performance.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121236639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Consolidated conflict detection for hardware transactional memory","authors":"Lihang Zhao, J. Draper","doi":"10.1145/2628071.2628076","DOIUrl":"https://doi.org/10.1145/2628071.2628076","url":null,"abstract":"Hardware Transactional Memory (HTM) promises to ease multithreaded parallel programming with uncompromised performance. Microprocessors supporting HTM implement a conflict detection mechanism to detect data access conflicts between transactions. Understanding the on-chip network bandwidth utilization of such mechanisms is important as the energy and latency cost of routing packets across the chip is growing alarmingly. We investigate the communication characteristics of a typical conflict detection mechanism. A variety of traffic overheads are identified, which accounts for a combined 56% of the total transactional traffic in a wide spectrum of applications. To combat this problem, we propose C2D (Consolidated Conflict Detection), a novel micro-architectural technique to consolidate conflict detection to a logically central (but physically distributed) agent to reduce the bandwidth utilization of conflict detection. Full system evaluation shows that the proposed technique, if applied to conventional eager conflict detection, can reduce 35% of the traffic and hence 27% of the network energy. The consolidated eager conflict detection generates less traffic than a lazy conflict detection scheme thereby closing the gap between bandwidth utilization of eager and lazy conflict detection.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126879955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}