2011 International Conference on Parallel Architectures and Compilation Techniques最新文献_第2页

Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs 相干轮廓:为基于环路的并行程序实现多核缩放的有效重用距离分析

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1145/2427631.2427632

Meng-Ju Wu, D. Yeung

{"title":"Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs","authors":"Meng-Ju Wu, D. Yeung","doi":"10.1145/2427631.2427632","DOIUrl":"https://doi.org/10.1145/2427631.2427632","url":null,"abstract":"Reuse distance (RD) analysis is a powerful memory analysis tool that can potentially help architects study multicore processor scaling. One key obstacle though is multicore RD analysis requires measuring concurrent reuse distance (CRD) profiles across thread-interleaved memory reference streams. Sensitivity to memory interleaving makes CRD profiles architecture dependent, preventing them from analyzing different processor configurations. For loop-based parallel programs, CRD profiles shift coherently to larger CRD values with core count scaling because interleaving threads are symmetric. Simple techniques can predict such shifting, making the analysis of numerous multicore configurations from a small set of CRD profiles feasible. Given the ubiquity and scalability of loop-level parallelism, such techniques will be extremely valuable for studying future large multicore designs. This paper investigates using RD analysis to efficiently analyze multicore cache performance for loop-based parallel programs, making several contributions. First, we provide in depth analysis on how CRD profiles change with core count scaling. Second, we develop techniques to predict CRD profile scaling, in particular employing reference groups to predict coherent shift, and evaluate prediction accuracy. Third, we show core count scaling only degrades performance for last level caches (LLCs) below 16MB for our benchmarks and problem sizes, increasing to 64 -- 128MB if problem size scales by 64x. Finally, we apply CRD profiles to analyze multicore cache performance. When combined with existing problem scaling prediction, our techniques can predict LLC MPKI to within 11.1% of simulation across 1,728 configurations using only 36 measured CRD profiles.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124946475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 62

Making STMs Cache Friendly with Compiler Transformations 通过编译器转换使stm缓存友好

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.55

Sandya Mannarswamy, R. Govindarajan

{"title":"Making STMs Cache Friendly with Compiler Transformations","authors":"Sandya Mannarswamy, R. Govindarajan","doi":"10.1109/PACT.2011.55","DOIUrl":"https://doi.org/10.1109/PACT.2011.55","url":null,"abstract":"Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs. In order for STMs to be adopted widely for performance critical software, understanding and improving the cache performance of applications running on STM becomes increasingly crucial, as the performance gap between processor and memory continues to grow. In this paper, we present the most detailed experimental evaluation to date, of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We find that STMs are not cache friendly, with the data cache stall cycles contributing to more than 50% of the execution cycles in a majority of the benchmarks. We find that on an average, misses occurring inside the STM account for 62% of total data cache miss latency cycles experienced by the applications and the cache performance is impacted adversely due to certain inherent characteristics of the STM itself. The above observations motivate us to propose a set of specific compiler transformations targeted at making the STMs cache friendly. We find that STM's fine grained and application unaware locking is a major contributor to its poor cache behavior. Hence we propose selective Lock Data co-location (LDC) and Redundant Lock Access Removal (RLAR) to address the lock access misses. We find that even transactions that are completely disjoint access parallel, suffer from costly coherence misses caused by the centralized global time stamp updates and hence we propose the Selective Per-Partition Time Stamp (SPTS) transformation to address this. We show that our transformations are effective in improving the cache behavior of STM applications by reducing the data cache miss latency by 20.15% to 37.14% and improving execution time by 18.32% to 33.12% in five of the 8 STAMP applications.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132529241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Speculative Parallelization in Decoupled Look-ahead 解耦预见性并行化

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.72

Alok Garg, Raj Parihar, Michael C. Huang

{"title":"Speculative Parallelization in Decoupled Look-ahead","authors":"Alok Garg, Raj Parihar, Michael C. Huang","doi":"10.1109/PACT.2011.72","DOIUrl":"https://doi.org/10.1109/PACT.2011.72","url":null,"abstract":"While a canonical out-of-order engine can effectively exploit implicit parallelism in sequential programs, its effectiveness is often hindered by instruction and data supply imperfections manifested as branch mispredictions and cache misses. Accurate and deep look-ahead guided by a slice of the executed program is a simple yet effective approach to mitigate the performance impact of branch mispredictions and cache misses. Unfortunately, program slice-guided look ahead is often limited by the speed of the look-ahead code slice, especially for irregular programs. In this paper, we attempt to speed up the look-ahead agent using speculative parallelization, which is especially suited for the task. First, slicing for look-ahead tends to reduce important data dependences that prohibit successful speculative parallelization. Second, the task for look-ahead is not correctness critical and thus naturally tolerates dependence violations. This enables an implementation to forgo violation detection altogether, simplifying architectural support tremendously. In a straightforward implementation, incorporating speculative parallelization to the look-ahead agent further improves system performance by up to 1.39x with an average of 1.13x.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128278988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

rPRAM: Exploring Redundancy Techniques to Improve Lifetime of PCM-based Main Memory 探索冗余技术以提高基于pcm的主存的寿命

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.40

Jie Chen, Zachary Winter, Guru Venkataramani, H. H. Huang

引用次数: 7

CriticalFault: Amplifying Soft Error Effect Using Vulnerability-Driven Injection CriticalFault:使用漏洞驱动注入放大软错误效果

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.25

Xin Xu, Man-Lap Li

引用次数: 2

StVEC: A Vector Instruction Extension for High Performance Stencil Computation StVEC:高性能模板计算的矢量指令扩展

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.59

N. Sedaghati, Renji Thomas, L. Pouchet, R. Teodorescu, P. Sadayappan

{"title":"StVEC: A Vector Instruction Extension for High Performance Stencil Computation","authors":"N. Sedaghati, Renji Thomas, L. Pouchet, R. Teodorescu, P. Sadayappan","doi":"10.1109/PACT.2011.59","DOIUrl":"https://doi.org/10.1109/PACT.2011.59","url":null,"abstract":"Stencil computations comprise the compute-intensive core of many scientific applications. The data access pattern of stencil computations often requires several adjacent data elements of arrays to be accessed in innermost parallel loops. Although such loops are vectorized by current compilers like GCC and ICC that target short-vector SIMD instruction sets, a number of redundant loads or additional intra-register data shuffle operations are required, reducing the achievable performance. Thus, even when all arrays are cache resident, the peak performance achieved with stencil computations is considerably lower than machine peak. In this paper, we present a hardware-based solution for this problem. We propose an extension to the standard addressing mode of vector floating-point instructions in ISAs such as SSE, AVX, VMX etc. We propose an extended mode of paired-register addressing and its hardware implementation, to overcome the performance limitation of current short-vector SIMD ISA's for stencil computations. Further, we present a code generation approach that can be used by a vectorizing compiler for processors with such an instructions set. Using an optimistic as well as a pessimistic emulation of the proposed instruction extension, we demonstrate the effectiveness of the proposed approach on top of SSE and AVX capable processors. We also synthesize parts of the proposed design using a 45nm CMOS library and show minimal impact on processor cycle time.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121488385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

A Compiler-assisted Runtime-prefetching Scheme for Heterogenous Platforms 面向异构平台的编译器辅助运行时预取方案

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1007/978-3-642-30961-8_9

Li Chen, B. Shou, Xionghui Hou, Lei Huang

引用次数: 5

Optimizing Data Layouts for Parallel Computation on Multicores 优化多核并行计算的数据布局

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.20

Yuanrui Zhang, W. Ding, Jun Liu, M. Kandemir

引用次数: 25

Exploiting Rank Idle Time for Scheduling Last-Level Cache Writeback 利用Rank空闲时间调度最后一级缓存回写

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.43

Zhe Wang, Daniel A. Jiménez

引用次数: 1

Sampling Temporal Touch Hint (STTH) Inclusive Cache Management Policy 采样时间触摸提示(STTH)包含缓存管理策略

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.42

Yingying Tian, Daniel A. Jiménez

引用次数: 1