2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献

Improving support for locality and fine-grain sharing in chip multiprocessors 改进芯片多处理器对局部性和细粒度共享的支持

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454138

Hemayet Hossain, S. Dwarkadas, Michael C. Huang

{"title":"Improving support for locality and fine-grain sharing in chip multiprocessors","authors":"Hemayet Hossain, S. Dwarkadas, Michael C. Huang","doi":"10.1145/1454115.1454138","DOIUrl":"https://doi.org/10.1145/1454115.1454138","url":null,"abstract":"Both commercial and scientific workloads benefit from concurrency and exhibit data sharing across threads/processes. The resulting sharing patterns are often fine-grain, with the modified cache lines still residing in the writer's primary cache when accessed. Chip multiprocessors present an opportunity to optimize for fine-grain sharing using direct access to remote processor components through low-latency on-chip interconnects. In this paper, we present Adaptive Replication, Migration, and producer-Consumer Optimization (ARMCO), a coherence protocol that, to the best of our knowledge, is the first to exploit direct access to the L1 caches of remote processors (rather than via coherence mechanisms) in order to support fine-grain sharing. Our goal is to provide support for tightly coupled sharing by recognizing and adapting to common sharing patterns such as migratory, producer-consumer, multiple-reader, and multiple read-write. The protocol places data close to where it is most needed and leverages direct access when following conventional coherence actions proves wasteful. Via targeted optimizations for each of these access patterns, our proposed protocol is able to reduce the average access latency and increase the effective cache capacity at the L1 level with on-chip storage overhead as low as 0.38%. Full-system simulations of 16-processor CMPs show an average (geometric mean) speedup of 1.13 (ranging from 1.04 to 2.26) for 12 commercial, scientific, and mining workloads, with an average of 1.18 if we include 2 microbenchmarks. ARMCO also reduces the on-chip bandwidth requirements and dynamic energy (power) consumption by an average of 33.3% and 31.2% (20.2%) respectively. By evaluating optimizations at both the L1 and the L2 level, we demonstrate that when considering performance, optimization at the L1 level is more effective at supporting fine-grain sharing than that at the L2 level.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124815720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Hybrid access-specific software cache techniques for the cell BE architecture 用于单元BE体系结构的混合特定于访问的软件缓存技术

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454156

Marc González, Nikola Vujic, X. Martorell, E. Ayguadé, A. Eichenberger, Tong Chen, Zehra Sura, Tao Zhang, K. O'Brien, Kathryn M. O'Brien

{"title":"Hybrid access-specific software cache techniques for the cell BE architecture","authors":"Marc González, Nikola Vujic, X. Martorell, E. Ayguadé, A. Eichenberger, Tong Chen, Zehra Sura, Tao Zhang, K. O'Brien, Kathryn M. O'Brien","doi":"10.1145/1454115.1454156","DOIUrl":"https://doi.org/10.1145/1454115.1454156","url":null,"abstract":"Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, high-locality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed code-optimizations translate into 3.5 to 8.4 speedup factors, compared to a traditional software cache approach. As a result, we demonstrate that the Cell BE processor can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128267827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

Skewed redundancy 倾斜的冗余

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454126

Gordon B. Bell, Mikko H. Lipasti

{"title":"Skewed redundancy","authors":"Gordon B. Bell, Mikko H. Lipasti","doi":"10.1145/1454115.1454126","DOIUrl":"https://doi.org/10.1145/1454115.1454126","url":null,"abstract":"Technology scaling in integrated circuits has consistently provided dramatic performance improvements in modern microprocessors. However, increasing device counts and decreasing on-chip voltage levels have made transient errors a first-order design constraint that can no longer be ignored. Several proposals have provided fault detection and tolerance through redundantly executing a program on an additional hardware thread or core. While such techniques can provide high fault coverage, they at best provide equivalent performance to the original execution and at worst incur a slowdown due to error checking, contention for shared resources, and synchronization overheads. This work achieves a similar goal of detecting transient errors by redundantly executing a program on an additional processor core, however it speeds up (rather than slows down) program execution compared to the unprotected baseline case. It makes the observation that a small number of instructions are detrimental to overall performance, and selectively skipping them enables one core to advance far ahead of the other to obtain prefetching and large instruction window benefits. We highlight the modest incremental hardware required to support skewed redundancy and demonstrate a speedup of 6%/54% for a collection of integer/floating point benchmarks while still providing 100% error detection coverage within our sphere of replication. Additionally, we show that a third core can further improve performance while adding error recovery capabilities.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127620106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Prediction models for multi-dimensional power-performance optimization on many cores 多核多维功率性能优化预测模型

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454151

Matthew Curtis-Maury, Ankur Shah, F. Blagojevic, Dimitrios S. Nikolopoulos, B. Supinski, M. Schulz

{"title":"Prediction models for multi-dimensional power-performance optimization on many cores","authors":"Matthew Curtis-Maury, Ankur Shah, F. Blagojevic, Dimitrios S. Nikolopoulos, B. Supinski, M. Schulz","doi":"10.1145/1454115.1454151","DOIUrl":"https://doi.org/10.1145/1454115.1454151","url":null,"abstract":"Power has become a primary concern for HPC systems. Dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) are two software tools (or knobs) for reducing the dynamic power consumption of HPC systems. To date, few works have considered the synergistic integration of DVFS and DCT in performance-constrained systems, and, to the best of our knowledge, no prior research has developed application-aware simultaneous DVFS and DCT controllers in real systems and parallel programming frameworks. We present a multi-dimensional, online performance predictor, which we deploy to address the problem of simultaneous runtime optimization of DVFS and DCT on multi-core systems. We present results from an implementation of the predictor in a runtime library linked to the Intel OpenMP environment and running on an actual dual-processor quad-core system. We show that our predictor derives near-optimal settings of the power-aware program adaptation knobs that we consider. Our overall framework achieves significant reductions in energy (19% mean) and ED2 (40% mean), through simultaneous power savings (6% mean) and performance improvements (14% mean). We also find that our framework outperforms earlier solutions that adapt only DVFS or DCT, as well as one that sequentially applies DCT then DVFS. Further, our results indicate that prediction-based schemes for runtime adaptation compare favorably and typically improve upon heuristic search-based approaches in both performance and energy savings.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"413 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126692646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 190

Multi-Optimization power management for chip multiprocessors 芯片多处理器的多优化电源管理

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454141

Ke Meng, R. Joseph, R. Dick, L. Shang

{"title":"Multi-Optimization power management for chip multiprocessors","authors":"Ke Meng, R. Joseph, R. Dick, L. Shang","doi":"10.1145/1454115.1454141","DOIUrl":"https://doi.org/10.1145/1454115.1454141","url":null,"abstract":"The emergence of power as a first-class design constraint has fueled the proposal of a growing number of run-time power optimizations. Many of these optimizations trade-off power saving opportunity for a variable performance loss which depends on application characteristics and program phase. Furthermore, the potential benefits of these optimizations are sometimes non-additive, and it can be difficult to identify which combinations of these optimizations to apply. Trial-and-error approaches have been proposed to adaptively tune a processor. However, in a chip multiprocessor, the cost of individually configuring each core under a wide range of optimizations would be prohibitive under simple trial-and-error approaches. In this work, we introduce an adaptive, multi-optimization power saving strategy for multi-core power management. Specifically, we solve the problem of meeting a global chip-wide power budget through run-time adaptation of highly configurable processor cores. Our approach applies analytic modeling to reduce exploration time and decrease the reliance on trial-and-error methods. We also introduce risk evaluation to balance the benefit of various power saving optimizations versus the potential performance loss. Overall, we find that our approach can significantly reduce processor power consumption compared to alternative optimization strategies.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122229590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 111

Mars: A MapReduce Framework on graphics processors Mars:一个基于图形处理器的MapReduce框架

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454152

Bingsheng He, Wenbin Fang, Qiong Luo, N. Govindaraju, Tuyong Wang

引用次数: 829

A tuning framework for software-managed memory hierarchies 软件管理内存层次结构的调优框架

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454155

Manman Ren, Ji Young Park, M. Houston, A. Aiken, W. Dally

引用次数: 43

MCAMP: Communication optimization on Massively Parallel Machines with hierarchical scratch-pad memory 基于分级刮擦板存储器的大规模并行机器上的通信优化

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454132

H. Hayashizaki, Yutaka Sugawara, M. Inaba, K. Hiraki

{"title":"MCAMP: Communication optimization on Massively Parallel Machines with hierarchical scratch-pad memory","authors":"H. Hayashizaki, Yutaka Sugawara, M. Inaba, K. Hiraki","doi":"10.1145/1454115.1454132","DOIUrl":"https://doi.org/10.1145/1454115.1454132","url":null,"abstract":"Massively parallel machines that integrate a large number of simple processors and small scratch-pad memories (SPMs) into a single chip can achieve a high peak performance per watt of power. In these machines, communication optimizations are important because the communication bandwidth tends to be a bottleneck. Previously proposed communication optimizations using copy candidates, which have been shown to be effective, detect frequently reused array regions by compile-time analysis and copy the regions to scratch-pad memories nearer to the processors. However, they have been proposed for uniprocessor systems or small parallel machines with one or more layers of scratch-pad memories, and the analysis time increases when they are applied to massively parallel machines. In this paper, we propose Multilayer Copy-candidate Analysis for Massively Parallel machines (MCAMP), a communication optimization method for massively parallel machines. MCAMP re-formalizes the framework used in earlier works and improves the scalability of the analysis by assuming the homogeneity of the target systems. We implemented an MCAMP optimizer, which takes an input program that consists of perfectly nested loops containing array references and computation codes, and generates optimized communication. We measured the performance of the output programs of the MCAMP optimizer by executing them on a real massively parallel machine GRAPE-DR using a software tool chain that we also implemented. We showed that MCAMP can achieve optimal data transfer patterns and comparable performance to that of hand-optimized codes with a short analysis time.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115833236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Edge-centric modulo scheduling for coarse-grained reconfigurable architectures 面向粗粒度可重构架构的以边缘为中心的模调度

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454140

Hyunchul Park, Kevin Fan, S. Mahlke, Taewook Oh, Heeseok Kim, Hong-Seok Kim

{"title":"Edge-centric modulo scheduling for coarse-grained reconfigurable architectures","authors":"Hyunchul Park, Kevin Fan, S. Mahlke, Taewook Oh, Heeseok Kim, Hong-Seok Kim","doi":"10.1145/1454115.1454140","DOIUrl":"https://doi.org/10.1145/1454115.1454140","url":null,"abstract":"Coarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing the potential for high computation throughput, scalability, low cost, and energy efficiency. CGRAs consist of an array of function units and register files often organized as a two dimensional grid. The most difficult challenge in deploying CGRAs is compiler scheduling technology that can efficiently map software implementations of compute intensive loops onto the array. Traditional schedulers focus on the placement of operations in time and space. With CGRAs, the challenge of placement is compounded by the need to explicitly route operands from producers to consumers. To systematically attack this problem, we take an edge-centric approach to modulo scheduling that focuses on the routing problem as its primary objective. With edge-centric modulo scheduling (EMS), placement is a by-product of the routing process, and the schedule is developed by routing each edge in the dataflow graph. Routing cost metrics provide the scheduler with a global perspective to guide selection. Experiments on a wide variety of compute-intensive loops from the multimedia domain show that EMS improves throughput by 25% over traditional iterative modulo scheduling, and achieves 98% of the throughput of simulated annealing techniques at a fraction of the compilation time.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128211249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 195

Multitasking workload scheduling on flexible-core chip multiprocessors 柔性核芯片多处理器上的多任务工作负载调度

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454142

Divya Gulati, Changkyu Kim, S. Sethumadhavan, S. Keckler, D. Burger

{"title":"Multitasking workload scheduling on flexible-core chip multiprocessors","authors":"Divya Gulati, Changkyu Kim, S. Sethumadhavan, S. Keckler, D. Burger","doi":"10.1145/1454115.1454142","DOIUrl":"https://doi.org/10.1145/1454115.1454142","url":null,"abstract":"While technology trends have ushered in the age of chip multiprocessors (CMP), a fundamental question is what size to make each core. Most current commercial designs are symmetric CMPs (SCMP) in which each core is identical and range from a simple RISC processor to a complex out-of-order x86 processor. Some researchers have proposed asymmetric CMPs (ACMP) consisting of multiple types of cores. While less of an issue for ACMPs, the fixed nature of both these architectures makes them vulnerable to mismatches between the granularity of the cores and the parallelism in the workload, which can cause inefficient execution. To remedy this weakness, recent research has proposed flexible-core CMPs (FCMP), which have the capability of aggregating multiple small processing cores to form larger logical processors. FCMPs introduce a new resource allocation and scheduling problem which must determine how many logical processors should be configured, how powerful each processor should be, and where/when each task should run. This paper introduces and motivates this problem, describes the challenges associated with it, and evaluates algorithms appropriate for multitasking on FCMPs. We also evaluate static-core CMPs of various configurations and compare them to FCMPs for various multitasking workloads.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131033034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30