2015 International Conference on Parallel Architecture and Compilation (PACT)最新文献_第2页

Using Hybrid Schedules to Safely Outperform Classical Polyhedral Schedules 使用混合调度安全优于经典多面体调度

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.52

Ti Jin

{"title":"Using Hybrid Schedules to Safely Outperform Classical Polyhedral Schedules","authors":"Ti Jin","doi":"10.1109/PACT.2015.52","DOIUrl":"https://doi.org/10.1109/PACT.2015.52","url":null,"abstract":"The Polyhedral model is a mathematical framework for programs with affine control loops that enables complex program transformations such as loop permutation and loop tiling to achieve parallelism, data locality and energy efficiency. Polyhedral schedules are widely used by popular polyhedral compilers such as AlphaZ and PLuTo to represent program execution orders. They use barriers to enforce the correct order of execution and usually synchronizations happen more than necessarily. Current research reveals the merit of combining the classical polyhedral schedules and partially ordered schedules manually written by hands with highly target dependent point-wise synchronization mechanisms. However, derivation of a hybrid schedule is tedious and error-prone due to the possibility of deadlocks. Its deviation from any existing standard representation makes program verication the sole responsibility of the programmer. We propose techniques to automate the derivation, verification and code-generation of hybrid schedules. We also demonstrate the convenience and utility of such techniques in resolving the complications associated with current hybrid schedules.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128158166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrating 3D Resistive Memory Cache into GPGPU for Energy-Efficient Data Processing 将三维电阻式内存缓存集成到GPGPU中实现高能效数据处理

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.60

Jie Zhang, D. Donofrio, J. Shalf, Myoungsoo Jung

{"title":"Integrating 3D Resistive Memory Cache into GPGPU for Energy-Efficient Data Processing","authors":"Jie Zhang, D. Donofrio, J. Shalf, Myoungsoo Jung","doi":"10.1109/PACT.2015.60","DOIUrl":"https://doi.org/10.1109/PACT.2015.60","url":null,"abstract":"General purpose graphics processing units (GPUs) have become a promising solution to process massive data by taking advantages of multithreading. Thanks to thread-level parallelism, GPU-accelerated applications improve the overall system performance by up to 40 times, compared to CPU-only architecture. However, data-intensive GPU applications often generate large amount of irregular data accesses, which results in cache thrashing and contention problems. The cache thrashing in turn can introduce a large number of off-chip memory accesses, which not only wastes tremendous energy to move data around on-chip cache and off-chip global memory, but also significantly limits system performance due to many stalled load/store instructions. In this work, we redesign the shared last-level cache (LLC) of GPU devices by introducing non-volatile memory (NVM), which can address the cache thrashing issues with low energy consumption. Specifically, we investigate two architectural approaches, one of each employs a 2D planar resistive random-access memory (RRAM) as our baseline NVM-cache and a 3D-stacked RRAM technology. Our baseline NVM-cache replaces the SRAM-based L2 cache with RRAM of similar area size; a memory die consists of eight subarrays, one of which a small fraction of memristor island by constructing 512x512 matrix. Since the feature size of SRAM is around 125 F2 (while that of RRAM around 4 F2), it can offer around 30x bigger storage capacity than the SRAM-based cache. To make our baseline NVM-cache denser, we proposed 3D-stacked NVM-cache, which piles up four memory layers, and each of them has a single pre-decode logic.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127045808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Exploiting Program Semantics to Place Data in Hybrid Memory 利用程序语义在混合存储器中放置数据

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.10

Wei Wei, D. Jiang, S. Mckee, Jin Xiong, Mingyu Chen

{"title":"Exploiting Program Semantics to Place Data in Hybrid Memory","authors":"Wei Wei, D. Jiang, S. Mckee, Jin Xiong, Mingyu Chen","doi":"10.1109/PACT.2015.10","DOIUrl":"https://doi.org/10.1109/PACT.2015.10","url":null,"abstract":"Large-memory applications like data analytics and graph processing benefit from extended memory hierarchies, and hybrid DRAM/NVM (non-volatile memory) systems represent an attractive means by which to increase capacity at reasonable performance/energy tradeoffs. Compared to DRAM, NVMs generally have longer latencies and higher energies for writes, which makes careful data placement essential for efficient system operation. Data placement strategies that resort to monitoring all data accesses and migrating objects to dynamically adjust data locations incur high monitoring overhead and unnecessary memory copies due to mispredicted migrations. We find that program semantics (specifically, global access characteristics) can effectively guide initial data placement with respect to memory types, which, in turn, makes run-time migration more efficient. We study a combined offline/online placement scheme that uses access profiling information to place objects statically and then selectively monitors run-time behaviors to optimize placements dynamically. We present a software/hardware cooperative framework, 2PP, and evaluate it with respect to state-of-the-art migratory placement, finding that it improves performance by an average of 12.1%. Furthermore, 2PP improves energy efficiency by up to 51.8%, and by an average of 18.4%. It does so by reducing run-time monitoring and migration overheads.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133727073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming PENCIL:一种用于加速器编程的平台中立的计算中间语言

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.17

Riyadh Baghdadi, Ulysse Beaugnon, Albert Cohen, T. Grosser, Michael Kruse, Chandan Reddy, Sven Verdoolaege, A. Betts, A. Donaldson, J. Ketema, J. Absar, S. V. Haastregt, Alexey Kravets, Anton Lokhmotov, R. David, Elnar Hajiyev

{"title":"PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming","authors":"Riyadh Baghdadi, Ulysse Beaugnon, Albert Cohen, T. Grosser, Michael Kruse, Chandan Reddy, Sven Verdoolaege, A. Betts, A. Donaldson, J. Ketema, J. Absar, S. V. Haastregt, Alexey Kravets, Anton Lokhmotov, R. David, Elnar Hajiyev","doi":"10.1109/PACT.2015.17","DOIUrl":"https://doi.org/10.1109/PACT.2015.17","url":null,"abstract":"Programming accelerators such as GPUs with low-level APIs and languages such as OpenCL and CUDA is difficult, error-prone, and not performance-portable. Automatic parallelization and domain specific languages (DSLs) have been proposed to hide complexity and regain performance portability. We present PENCIL, a rigorously-defined subset of GNU C99-enriched with additional language constructs-that enables compilers to exploit parallelism and produce highly optimized code when targeting accelerators. PENCIL aims to serve both as a portable implementation language for libraries, and as a target language for DSL compilers. We implemented a PENCIL-to-OpenCL backend using a state-of-the-art polyhedral compiler. The polyhedral compiler, extended to handle data-dependent control flow and non-affine array accesses, generates optimized OpenCL code. To demonstrate the potential and performance portability of PENCIL and the PENCIL-to-OpenCL compiler, we consider a number of image processing kernels, a set of benchmarks from the Rodinia and SHOC suites, and DSL embedding scenarios for linear algebra (BLAS) and signal processing radar applications (SpearDE), and present experimental results for four GPU platforms: AMD Radeon HD 5670 and R9 285, NVIDIA GTX 470, and ARM Mali-T604.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116877946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 100

Throttling Automatic Vectorization: When Less is More 节流自动矢量化:当少即是多

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.32

Vasileios Porpodas, Timothy M. Jones

{"title":"Throttling Automatic Vectorization: When Less is More","authors":"Vasileios Porpodas, Timothy M. Jones","doi":"10.1109/PACT.2015.32","DOIUrl":"https://doi.org/10.1109/PACT.2015.32","url":null,"abstract":"SIMD vectors are widely adopted in modern general purpose processors as they can boost performance and energy efficiency for certain applications. Compiler-based automatic vectorization is one approach for generating codethat makes efficient use of the SIMD units, and has the benefit of avoiding hand development and platform-specific optimizations. The Superword-Level Parallelism (SLP) vectorization algorithm is the most well-known implementation of automatic vectorization when starting from straight-line scalar code, and is implemented in several major compilers. The existing SLP algorithm greedily packs scalar instructions into vectors starting from stores and traversing the data dependence graph upwards until it reaches loads or non-vectorizable instructions. Choosing whether to vectorize is a one-off decision for the whole graph that has been generated. This, however, is sub-optimal because the graph may contain code that is harmful to vectorization due to the need to move data from scalar registers into vectors. The decision does not consider the potential benefits of throttling the graph by removing this harmful code. In this work we propose asolution to overcome this limitation by introducing Throttled SLP (TSLP), a novel vectorization algorithm that finds the optimal graph to vectorize, forcing vectorization to stop earlier whenever this is beneficial. Our experiments show that TSLP improves performance across a number of kernels extractedfrom widely-used benchmark suites, decreasing execution time compared to SLP by 9% on average and up to 14% in the best case.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123322151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Towards General-Purpose Neural Network Computing 面向通用神经网络计算

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.21

Schuyler Eldridge, Amos Waterland, M. Seltzer, J. Appavoo, A. Joshi

{"title":"Towards General-Purpose Neural Network Computing","authors":"Schuyler Eldridge, Amos Waterland, M. Seltzer, J. Appavoo, A. Joshi","doi":"10.1109/PACT.2015.21","DOIUrl":"https://doi.org/10.1109/PACT.2015.21","url":null,"abstract":"Machine learning is becoming pervasive, decades of research in neural network computation is now being leveraged to learn patterns in data and perform computations that are difficult to express using standard programming approaches. Recent work has demonstrated that custom hardware accelerators for neural network processing can outperform software implementations in both performance and power consumption. However, there is neither an agreed-upon interface to neural network accelerators nor a consensus on neural network hardware implementations. We present a generic set of software/hardware extensions, X-FILES, that allow for the general-purpose integration of feedforward and feedback neural network computation in applications. The interface is independent of the network type, configuration, and implementation. Using these proposed extensions, we demonstrate and evaluate an example dynamically allocated, multi-context neural network accelerator architecture, DANA. We show that the combination of X-FILES and our hardware prototype, DANA, enables generic support and increased throughput for neural-network-based computation in multi-threaded scenarios.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122726240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Runtime Value Numbering: A Profiling Technique to Pinpoint Redundant Computations 运行时值编号:精确定位冗余计算的分析技术

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.29

Shasha Wen, Xu Liu, Milind Chabbi

{"title":"Runtime Value Numbering: A Profiling Technique to Pinpoint Redundant Computations","authors":"Shasha Wen, Xu Liu, Milind Chabbi","doi":"10.1109/PACT.2015.29","DOIUrl":"https://doi.org/10.1109/PACT.2015.29","url":null,"abstract":"Redundant computations can severely degrade performance in HPC applications. Redundant computations arise due to various causes such as developers' inattention to performance, inappropriate choice of algorithms, and inefficient code generation, among others. Aliasing, limited optimization scopes, and insensitivity to input and execution contexts act as severe deterrents to static program analysis. Furthermore, static analysis cannot quantify the benefit from redundancy elimination. Consequently, large optimization efforts may yield little or no benefit. To address these limitations, we develop a dynamic profiler to pinpoint and quantify redundant computations in an execution. Our methodology -- Runtime Value Numbering (RVN) -- is based on the classical value numbering technique but works at runtime instead of compile time. RVN works on unmodified, fully-optimized binaries. RVN provides insightful feedback about redundancies and helps developers tune their applications for high performance. Since RVN employs fine-grained instrumentation, it incurs high overhead. We apply several optimizations to reduce the profiling overhead. Guided by the feedback from RVN, we optimize four benchmarks from SPEC CPU2000/2006 suite, the Sweep3D, and NAS Multi Grid (MG). We speed up these programs up to 1.22X. RVN identifies computation redundancies that compilers failed to optimize even with profile guided optimization.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"255 23","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120871978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

OSPREY: Implementation of Memory Consistency Models for Cache Coherence Protocols involving Invalidation-Free Data Access 涉及无失效数据访问的缓存一致性协议的内存一致性模型的实现

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.45

George Kurian, Qingchuan Shi, S. Devadas, O. Khan

{"title":"OSPREY: Implementation of Memory Consistency Models for Cache Coherence Protocols involving Invalidation-Free Data Access","authors":"George Kurian, Qingchuan Shi, S. Devadas, O. Khan","doi":"10.1109/PACT.2015.45","DOIUrl":"https://doi.org/10.1109/PACT.2015.45","url":null,"abstract":"Data access in modern processors contributes significantly to the overall performance and energy consumption. Traditionally, data is distributed among the cores through an on-chip cache hierarchy, and each producer/consumer accesses data through its private level-1 cache relying on the cache coherence protocol for consistency. Recently, remote access, a mechanism that reduces energy and latency through word-level access to data anywhere on chip has been proposed. Remote access does not replicate data in the private caches, and thereby removes the need for expensive cache line invalidations or updates. Researchers have implemented remote access as an auxiliary mechanism in cache coherence to improve efficiency. Unfortunately, stronger memory models, such as Intel's TSO, require strict ordering among the loads and stores. This introduces serialization penalties for data classified to be accessed remotely, which hampers each core's ability to optimally exploit memory level parallelism. In this paper we propose a novel timestamp-based scheme to detect memory consistency violations. The proposed scheme enables remote accesses to be issued and completed in parallel while continuously detecting whether any ordering violations have occurred, and rolling back the pipeline state (if needed). We implement our scheme for the locality-aware cache coherence protocol that uses remote access as an auxiliary mechanism for efficient data access. Our evaluation using a 64-core multicore processor with out-of-order speculative cores shows that the proposed technique improves completion time by 26% and energy by 20% over a state-of-the-art cache management scheme.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126659383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

DVFS-Aware Consolidation for Energy-Efficient Clouds 支持dvfs的节能云整合

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.59

Patricia Arroba, Jose M. Moya, J. Ayala, R. Buyya

引用次数: 39

Cosmology and Computers: HACCing the Universe 宇宙学和计算机:探索宇宙

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.50

S. Habib

引用次数: 3