2011 International Conference on Parallel Architectures and Compilation Techniques最新文献

A Hierarchical Approach to Maximizing MapReduce Efficiency 最大化MapReduce效率的分层方法

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.22

Zhiwei Xiao, Haibo Chen, B. Zang

{"title":"A Hierarchical Approach to Maximizing MapReduce Efficiency","authors":"Zhiwei Xiao, Haibo Chen, B. Zang","doi":"10.1109/PACT.2011.22","DOIUrl":"https://doi.org/10.1109/PACT.2011.22","url":null,"abstract":"MapReduce has been widely recognized for its elastic scalability and fault tolerance, with the efficiency being relatively disregarded, which, however, is equally important in \"pay-as-you-go\" cloud systems such as Amazon's Elastic Map Reduce. This paper argues that there are multiple levels of data locality and parallelism in typical multicore clusters that affect performance. By characterizing the performance limitations of typical Map Reduce applications on multi-core based Hadoop clusters, we show that current JVM-based runtime (i.e., Task Worker) fails to exploit data locality and task parallelism at single-node level. Based on the study, we extend Hadoop with a hierarchical Map Reduce model and seamlessly integrate an efficient multicore Map Reduce runtime to Hadoop, resulting in a system we called Azwraith. Such a hierarchical scheme enables Map Reduce applications to explore locality and parallelism at both cluster level and single-node level. To reuse data across job boundary, we also extend Azwraith with an effective in-memory cache scheme that significantly reduces networking and disk traffics. Performance evaluation on a small-scale cluster show that, Azwraith, combined with the optimizations, outperforms Hadoop from 1.4x to 3.5x.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124469566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Building Retargetable and Efficient Compilers for Multimedia Instruction Sets 为多媒体指令集构建可重目标的高效编译器

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.23

S. Guelton, A. Guinet, R. Keryell

引用次数: 5

Beforehand Migration on D-NUCA Caches 预先迁移D-NUCA缓存

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.38

Javier Lira, Timothy M. Jones, Carlos Molina, Antonio González

引用次数: 2

Linear-time Modeling of Program Working Set in Shared Cache 共享缓存中程序工作集的线性时间建模

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.66

Xiaoya Xiang, Bin Bao, C. Ding, Yaoqing Gao

引用次数: 67

Divergence Analysis and Optimizations 发散分析和优化

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.63

Bruno Coutinho, Diogo Sampaio, Fernando Magno Quintão Pereira, Wagner Meira Jr

{"title":"Divergence Analysis and Optimizations","authors":"Bruno Coutinho, Diogo Sampaio, Fernando Magno Quintão Pereira, Wagner Meira Jr","doi":"10.1109/PACT.2011.63","DOIUrl":"https://doi.org/10.1109/PACT.2011.63","url":null,"abstract":"The growing interest in GPU programming has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers a tremendous computational power, however, the model also brings restrictions. In particular, processing elements (PEs) execute in lock-step, and may lose performance due to divergences caused by conditional branches. In face of divergences, some PEs execute, while others wait, this alternation ending when they reach a synchronization point. In this paper we introduce divergence analysis, a static analysis that determines which program variables will have the same values for every PE. This analysis is useful in three different ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the compiler in the optimization of SIMD programs. We demonstrate this last point by introducing branch fusion, a new compiler optimization that identifies, via a gene sequencing algorithm, chains of similarities between divergent program paths, and weaves these paths together as much as possible. Our implementation has been accepted in the Ocelot open-source CUDA compiler, and is publicly available. We have tested it on many industrial-strength GPU benchmarks, including Rodinia and the Nvidia's SDK. Our divergence analysis has a 34% false-positive rate, compared to the results of a dynamic profiler. Our automatic optimization adds a 3% speed-up onto parallel quick sort, a heavily optimized benchmark. Our manual optimizations extend this number to over 10%.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132635518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 104

TIDeFlow: A Parallel Execution Model for High Performance Computing Programs TIDeFlow:高性能计算程序的并行执行模型

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.44

Daniel A. Orozco

引用次数: 7

A Heterogeneous Parallel Framework for Domain-Specific Languages 面向领域特定语言的异构并行框架

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.15

Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, K. Olukotun

{"title":"A Heterogeneous Parallel Framework for Domain-Specific Languages","authors":"Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, K. Olukotun","doi":"10.1109/PACT.2011.15","DOIUrl":"https://doi.org/10.1109/PACT.2011.15","url":null,"abstract":"Computing systems are becoming increasingly parallel and heterogeneous, and therefore new applications must be capable of exploiting parallelism in order to continue achieving high performance. However, targeting these emerging devices often requires using multiple disparate programming models and making decisions that can limit forward scalability. In previous work we proposed the use of domain-specific languages (DSLs) to provide high-level abstractions that enable transformations to high performance parallel code without degrading programmer productivity. In this paper we present a new end-to-end system for building, compiling, and executing DSL applications on parallel heterogeneous hardware, the Delite Compiler Framework and Runtime. The framework lifts embedded DSL applications to an intermediate representation (IR), performs generic, parallel, and domain-specific optimizations, and generates an execution graph that targets multiple heterogeneous hardware devices. Finally we present results comparing the performance of several machine learning applications written in OptiML, a DSL for machine learning that utilizes Delite, to C++ and MATLAB implementations. We find that the implicitly parallel OptiML applications achieve single-threaded performance comparable to C++ and outperform explicitly parallel MATLAB in nearly all cases.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122168646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 201

Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling 使用动态电压/频率和核心缩放提高功率受限gpu的吞吐量

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.17

Jungseob Lee, V. Sathish, M. Schulte, Katherine Compton, N. Kim

{"title":"Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling","authors":"Jungseob Lee, V. Sathish, M. Schulte, Katherine Compton, N. Kim","doi":"10.1109/PACT.2011.17","DOIUrl":"https://doi.org/10.1109/PACT.2011.17","url":null,"abstract":"State-of-the-art graphic processing units (GPUs) can offer very high computational throughput for highly parallel applications using hundreds of integrated cores. In general, the peak throughput of a GPU is proportional to the product of the number of cores and their frequency. However, the product is often limited by a power constraint. Although the throughput can be increased with more cores for some applications, it cannot for others because parallelism of applications and/or bandwidth of on-chip interconnects/caches and off-chip memory are limited. In this paper, first, we demonstrate that adjusting the number of operating cores and the voltage/frequency of cores and/or on-chip interconnects/caches for different applications can improve the throughput of GPUs under a power constraint. Second, we show that dynamically scaling the number of operating cores and the voltages/frequencies of both cores and on-chip interconnects/caches at runtime can improve the throughput of application even further. Our experimental results show that a GPU adopting our runtime dynamic voltage/frequency and core scaling technique can provide up to 38% (and nearly 20% on average) higher throughput than the baseline GPU under the same power constraint.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127155673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 73

DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism DeNovo:重新思考有纪律并行的内存层次结构

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.21

Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, N. Honarmand, S. Adve, Vikram S. Adve, N. Carter, Ching-Tsun Chou

{"title":"DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism","authors":"Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, N. Honarmand, S. Adve, Vikram S. Adve, N. Carter, Ching-Tsun Chou","doi":"10.1109/PACT.2011.21","DOIUrl":"https://doi.org/10.1109/PACT.2011.21","url":null,"abstract":"For parallelism to become tractable for mass programmers, shared-memory languages and environments must evolve to enforce disciplined practices that ban \"wild shared-memory behaviors;'' e.g., unstructured parallelism, arbitrary data races, and ubiquitous non-determinism. This software evolution is a rare opportunity for hardware designers to rethink hardware from the ground up to exploit opportunities exposed by such disciplined software models. Such a co-designed effort is more likely to achieve many-core scalability than a software-oblivious hardware evolution. This paper presents DeNovo, a hardware architecture motivated by these observations. We show how a disciplined parallel programming model greatly simplifies cache coherence and consistency, while enabling a more efficient communication and cache architecture. The DeNovo coherence protocol is simple because it eliminates transient states -- verification using model checking shows 15X fewer reachable states than a state-of-the-art implementation of the conventional MESI protocol. The DeNovo protocol is also more extensible. Adding two sophisticated optimizations, flexible communication granularity and direct cache-to-cache transfers, did not introduce additional protocol states (unlike MESI). Finally, DeNovo shows better cache hit rates and network traffic, translating to better performance and energy. Overall, a disciplined shared-memory programming model allows DeNovo to seamlessly integrate message passing-like interactions within a global address space for improved design complexity, performance, and efficiency.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126869590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 173

Dynamic Fine-Grain Scheduling of Pipeline Parallelism 管道并行性的动态细粒度调度

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.9

Daniel Sánchez, David Lo, Richard M. Yoo, J. Sugerman, C. Kozyrakis

{"title":"Dynamic Fine-Grain Scheduling of Pipeline Parallelism","authors":"Daniel Sánchez, David Lo, Richard M. Yoo, J. Sugerman, C. Kozyrakis","doi":"10.1109/PACT.2011.9","DOIUrl":"https://doi.org/10.1109/PACT.2011.9","url":null,"abstract":"Scheduling pipeline-parallel programs, defined as a graph of stages that communicate explicitly through queues, is challenging. When the application is regular and the underlying architecture can guarantee predictable execution times, several techniques exist to compute highly optimized static schedules. However, these schedules do not admit run-time load balancing, so variability introduced by the application or the underlying hardware causes load imbalance, hindering performance. On the other hand, existing schemes for dynamic fine-grain load balancing (such as task-stealing) do not work well on pipeline-parallel programs: they cannot guarantee memory footprint bounds, and do not adequately schedule complex graphs or graphs with ordered queues. We present a scheduler implementation for pipeline-parallel programs that performs fine-grain dynamic load balancing efficiently. Specifically, we implement the first real runtime for GRAMPS, a recently proposed programming model that focuses on supporting irregular pipeline and data-parallel applications (in contrast to classical stream programming models and schedulers, which require programs to be regular). Task-stealing with per-stage queues and queuing policies, coupled with a backpressure mechanism, allow us to maintain strict footprint bounds, and a buffer management scheme based on packet-stealing allows low-overhead and locality-aware dynamic allocation of queue data. We evaluate our runtime on a multi-core SMP and find that it provides low-overhead scheduling of irregular workloads while maintaining locality. We also show that the GRAMPS scheduler outperforms several other commonly used scheduling approaches. Specifically, while a typical task-stealing scheduler performs on par with GRAMPS on simple graphs, it does significantly worse on complex ones, a canonical GPGPU scheduler cannot exploit pipeline parallelism and suffers from large memory footprints, and a typical static, streaming scheduler achieves somewhat better locality, but suffers significant load imbalance on a general-purpose multi-core due to fine-grain architecture variability (e.g., cache misses and SMT).","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126203212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61