2014 23rd International Conference on Parallel Architecture and Compilation (PACT)最新文献_第4页

Design of a hybrid MPI-CUDA benchmark suite for CPU-GPU clusters CPU-GPU集群的MPI-CUDA混合基准套件设计

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2671423

T. Agarwal, M. Becchi

引用次数: 6

Graph-based performance accounting for chip multiprocessor memory systems 基于图形的芯片多处理器存储系统性能核算

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628111

Magnus Jahre

引用次数: 0

EFetch: Optimizing instruction fetch for event-driven web applications EFetch:为事件驱动的web应用程序优化指令获取

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628103

Gaurav Chadha, S. Mahlke, S. Narayanasamy

{"title":"EFetch: Optimizing instruction fetch for event-driven web applications","authors":"Gaurav Chadha, S. Mahlke, S. Narayanasamy","doi":"10.1145/2628071.2628103","DOIUrl":"https://doi.org/10.1145/2628071.2628103","url":null,"abstract":"Web 2.0 applications written in JavaScript are increasingly popular as they are easy to use, easy to update and maintain, and portable across a wide variety of computing platforms. Web applications receive frequent input from a rich array of sensors, network, and user input modalities. To handle the resulting asynchrony due to these inputs, web applications are developed using an event-driven programming model. These event-driven web applications have dramatically different characteristics, which provides an opportunity to create a customized processor core to improve the responsiveness of web applications. In this paper, we take one step towards creating a core customized to event-driven applications. We observe that instruction cache misses of web applications are substantially higher than conventional server and desktop workloads due to large working sets caused by distant re-use. To mitigate this bottleneck, we propose an instruction prefetcher (EFetch) that is tuned to exploit the characteristics of web applications. We find that an event signature, which captures the current event and function calling context, is a good predictor of the control flow inside a function of an event-driven program. It allows us to accurately predict a function's callees and their function bodies and prefetch them in a timely manner. For a set of real-world web applications, we show that the proposed prefetcher outperforms commonly implemented next-2-line prefetcher by 17%. Also, it consumes 5.2 times less area than a recently proposed prefetcher, while outperforming it.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124278057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

From petascale to the pocket: Adaptively scaling parallel programs for mobile SoCs 从千兆级到口袋:移动soc的自适应扩展并行程序

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2671426

Adam Fidel, N. Amato, Lawrence Rauchwerger

引用次数: 1

DeSTM: Harnessing determinism in STMs for application development DeSTM:利用stm中的确定性进行应用程序开发

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628094

K. Ravichandran, Ada Gavrilovska, S. Pande

{"title":"DeSTM: Harnessing determinism in STMs for application development","authors":"K. Ravichandran, Ada Gavrilovska, S. Pande","doi":"10.1145/2628071.2628094","DOIUrl":"https://doi.org/10.1145/2628071.2628094","url":null,"abstract":"Non-determinism has long been recognized as one of the key challenges which restrict parallel programmer productivity by complicating several phases of application development. While Software Transactional Memory (STM) systems have greatly improved the productivity of programmers developing parallel applications in a host of areas they still exhibit non-deterministic behavior leading to decreased productivity. While determinism in parallel applications which use traditional synchronization primitives (such as locks) has been relatively well studied, its interplay with STMs has not. In this paper we present DeSTM, a deterministic STM, which allows programmers to leverage determinism through the implementation, debugging and testing phases of application development. In this work we first adapt techniques which introduce determinism in applications which use traditional synchronization (such as locks) to work in conjunction with certain STMs. As one would expect, this does lead to performance degradation over a non-deterministic execution. Next we present, DeSTM, which uses novel techniques exploiting the properties of these STMs to dramatically improve the performance of deterministic executions. Further, DeSTM allows programmers to randomly change the deterministic schedule in a controlled fashion giving programmers access to a wide variety of execution schedules during application development. We demonstrate our approach on the STAMP benchmark suite. We first study the overheads that determinism introduces in STM applications and then demonstrate how DeSTM is able to improve performance of deterministic execution significantly, by over 50% in some applications and on average by about 35%. DeSTM also actually helped us detect, what we believe is a bug, in one of the benchmarks. Further, our approach is programmer friendly requiring no changes to application code.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"35 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126057919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Using STT-RAM to enable energy-efficient near-threshold chip multiprocessors 使用STT-RAM实现节能近阈值芯片多处理器

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628132

Xiang Pan, R. Teodorescu

引用次数: 3

Adaptive heterogeneous scheduling for integrated GPUs 集成gpu的自适应异构调度

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628088

R. Kaleem, R. Barik, T. Shpeisman, B. Lewis, Chunling Hu, K. Pingali

{"title":"Adaptive heterogeneous scheduling for integrated GPUs","authors":"R. Kaleem, R. Barik, T. Shpeisman, B. Lewis, Chunling Hu, K. Pingali","doi":"10.1145/2628071.2628088","DOIUrl":"https://doi.org/10.1145/2628071.2628088","url":null,"abstract":"Many processors today integrate a CPU and GPU on the same die, which allows them to share resources like physical memory and lowers the cost of CPU-GPU communication. As a consequence, programmers can effectively utilize both the CPU and GPU to execute a single application. This paper presents novel adaptive scheduling techniques for integrated CPU-GPU processors. We present two online profiling-based scheduling algorithms: naïve and asymmetric. Our asymmetric scheduling algorithm uses low-overhead online profiling to automatically partition the work of dataparallel kernels between the CPU and GPU without input from application developers. It does profiling on the CPU and GPU in a way that doesn't penalize GPU-centric workloads that run significantly faster on the GPU. It adapts to application characteristics by addressing: 1) load imbalance via irregularity caused by, e.g., data-dependent control flow, 2) different amounts of work on each kernel call, and 3) multiple kernels with different characteristics. Unlike many existing approaches primarily targeting NVIDIA discrete GPUs, our scheduling algorithm does not require offline processing. We evaluate our asymmetric scheduling algorithm on a desktop system with an Intel 4th Generation Core Processor using a set of sixteen regular and irregular workloads from diverse application areas. On average, our asymmetric scheduling algorithm performs within 3.2% of the maximum throughput with a CPU-and-GPU oracle that always chooses the best work partitioning between the CPU and GPU. These results underscore the feasibility of online profile-based heterogeneous scheduling on integrated CPU-GPU processors.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114388419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 107

CAWS: Criticality-aware warp scheduling for GPGPU workloads CAWS: GPGPU工作负载的临界感知warp调度

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628107

Shin-Ying Lee, Carole-Jean Wu

{"title":"CAWS: Criticality-aware warp scheduling for GPGPU workloads","authors":"Shin-Ying Lee, Carole-Jean Wu","doi":"10.1145/2628071.2628107","DOIUrl":"https://doi.org/10.1145/2628071.2628107","url":null,"abstract":"The ability to perform fast context-switching and massive multi-threading is the forte of modern GPU architectures, which have emerged as an efficient alternative to traditional chip-multiprocessors for parallel workloads. One of the main benefits of such architecture is its latency-hiding capability. However, the efficacy of GPU's latency-hiding varies significantly across GPGPU applications. To investigate this, this paper first proposes a new algorithm that profiles execution behavior of GPGPU applications. We characterize latencies caused by various pipeline hazards, memory accesses, synchronization primitives, and the warp scheduler. Our results show that the current round-robin warp scheduler works well in overlapping various latency stalls with the execution of other available warps for only a few GPGPU applications. For other applications, there is an excessive latency stall that cannot be hidden by the scheduler effectively. With the latency characterization insight, we observe a significant execution time disparity for warps within the same thread block, which causes suboptimal performance, called the warp criticality problem. To tackle the warp criticality problem, we design a family of criticality-aware warp scheduling (CAWS) policies by scheduling the critical warp(s) more frequently than other warps. Our results on the breadth-first-search, B+tree search, two point angular correlation function, and K-means clustering show that, with oracle knowledge of warp criticality, our best-performing scheduling policy can improve GPGPU applications' performance by 17% on average. With our designed criticality predictor, the various scheduling policies can improve performance by 10–21% on breadth-first-search. To our knowledge, this is the first paper to characterize warp criticality and explore different criticality-aware warp scheduling policies for GPGPU workloads.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130150630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 77

Trading cache hit rate for memory performance 用缓存命中率换取内存性能

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628082

W. Ding, M. Kandemir, D. Guttman, Adwait Jog, C. Das, Praveen Yedlapalli

{"title":"Trading cache hit rate for memory performance","authors":"W. Ding, M. Kandemir, D. Guttman, Adwait Jog, C. Das, Praveen Yedlapalli","doi":"10.1145/2628071.2628082","DOIUrl":"https://doi.org/10.1145/2628071.2628082","url":null,"abstract":"Most of the prior compiler based data locality optimization works target exclusively cache locality optimization, and row-buffer locality in DRAM banks received much less attention. In particular, to the best of our knowledge, there is no single compiler based approach that can improve row-buffer locality in executing irregular applications. This presents a critical problem considering the fact that executing irregular applications in a power and performance efficient manner will be a key requirement to extract maximum benefits from emerging multicore machines and exascale systems. Motivated by these observations, this paper makes the following contributions. First, it presents a compiler-runtime cooperative data layout optimization approach that takes as input an irregular program that has already been optimized for cache locality and generates an output code with the same cache performance but better row-buffer locality (lower number of row-buffer misses). Second, it discusses a more aggressive strategy that sacrifices some cache performance in order to further improve row-buffer performance (i.e., it trades cache performance for memory system performance). The ultimate goal of this strategy is to find the right tradeoff point between cache performance and row-buffer performance so that the overall application performance is improved. Third, the paper performs a detailed evaluation of these two approaches using both an AMD Opteron based multicore system and a multicore simulator. The experimental results, collected using five real-world irregular applications, show that (i) conventional cache optimizations do not improve row-buffer locality significantly; (ii) our first approach achieves about 9.8% execution time improvement by keeping the number of cache misses the same as a cache-optimized code but reducing the number of row-buffer misses; and (iii) our second approach achieves even higher execution time improvements (13.8% on average) by sacrificing cache performance for additional memory performance.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121236639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Consolidated conflict detection for hardware transactional memory 硬件事务性内存的统一冲突检测

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628076

Lihang Zhao, J. Draper

{"title":"Consolidated conflict detection for hardware transactional memory","authors":"Lihang Zhao, J. Draper","doi":"10.1145/2628071.2628076","DOIUrl":"https://doi.org/10.1145/2628071.2628076","url":null,"abstract":"Hardware Transactional Memory (HTM) promises to ease multithreaded parallel programming with uncompromised performance. Microprocessors supporting HTM implement a conflict detection mechanism to detect data access conflicts between transactions. Understanding the on-chip network bandwidth utilization of such mechanisms is important as the energy and latency cost of routing packets across the chip is growing alarmingly. We investigate the communication characteristics of a typical conflict detection mechanism. A variety of traffic overheads are identified, which accounts for a combined 56% of the total transactional traffic in a wide spectrum of applications. To combat this problem, we propose C2D (Consolidated Conflict Detection), a novel micro-architectural technique to consolidate conflict detection to a logically central (but physically distributed) agent to reduce the bandwidth utilization of conflict detection. Full system evaluation shows that the proposed technique, if applied to conventional eager conflict detection, can reduce 35% of the traffic and hence 27% of the network energy. The consolidated eager conflict detection generates less traffic than a lazy conflict detection scheme thereby closing the gap between bandwidth utilization of eager and lazy conflict detection.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126879955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2