2011 International Conference on Parallel Architectures and Compilation Techniques最新文献_第4页

Using a Reconfigurable L1 Data Cache for Efficient Version Management in Hardware Transactional Memory 在硬件事务性内存中使用可重构L1数据缓存进行有效的版本管理

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.67

Adrià Armejach, A. Seyedi, Rubén Titos-Gil, I. Hur, Adri´n Cristal, O. Unsal, M. Valero

{"title":"Using a Reconfigurable L1 Data Cache for Efficient Version Management in Hardware Transactional Memory","authors":"Adrià Armejach, A. Seyedi, Rubén Titos-Gil, I. Hur, Adri´n Cristal, O. Unsal, M. Valero","doi":"10.1109/PACT.2011.67","DOIUrl":"https://doi.org/10.1109/PACT.2011.67","url":null,"abstract":"Transactional Memory (TM) potentially simplifies parallel programming by providing atomicity and isolation for executed transactions. One of the key mechanisms to provide such properties is version management, which defines where and how transactional updates (new values) are stored. Version management can be implemented either eagerly or lazily. In Hardware Transactional Memory (HTM) implementations, eager version management puts new values in-place and old values are kept in a software log, while lazy version management stores new values in hardware buffers keeping old values in-place. Current HTM implementations, for both eager and lazy version management schemes, suffer from performance penalties due to the inability to handle two versions of the same logical data efficiently. In this paper, we introduce a reconfigurable L1 data cache architecture that has two execution modes: a 64KB general purpose mode and a 32KB TM mode which is able to manage two versions of the same logical data. The latter allows to handle old and new transactional values within the cache simultaneously when executing transactional workloads. We explain in detail the architectural design and internals of this Reconfigurable Data Cache (RDC), as well as the supported operations that allow to efficiently solve existing version management problems. We describe how the RDC can support both eager and lazy HTM systems, and we present two RDC-HTM designs. Our evaluation shows that the Eager-RDC-HTM and Lazy-RDC-HTM systems achieve 1.36x and 1.18x speedup, respectively, over state-of-the-art proposals. We also evaluate the area and energy effects of our proposal, and we find that RDC designs are 1.92x and 1.38x more energy-delay efficient compared to baseline HTM systems, with less than 0.3% area impact on modern processors.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121528355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

A Software-Managed Coherent Memory Architecture for Manycores 软件管理的多核相干内存体系结构

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.46

Jungho Park, Choonki Jang, Jaejin Lee

引用次数: 4

Performance Per Watt Benefits of Dynamic Core Morphing in Asymmetric Multicores 非对称多核中动态核变形的每瓦性能优势

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.18

Rance Rodrigues, A. Annamalai, I. Koren, S. Kundu, O. Khan

{"title":"Performance Per Watt Benefits of Dynamic Core Morphing in Asymmetric Multicores","authors":"Rance Rodrigues, A. Annamalai, I. Koren, S. Kundu, O. Khan","doi":"10.1109/PACT.2011.18","DOIUrl":"https://doi.org/10.1109/PACT.2011.18","url":null,"abstract":"The trend toward multicore processors is moving the emphasis in computation from sequential to parallel processing. However, not all applications can be parallelized and benefit from multiple cores. Such applications lead to under-utilization of parallel resources, hence sub-optimal performance/watt. They may however, benefit from powerful uniprocessors. On the other hand, not all applications can take advantage of more powerful uniprocessors. To address competing requirements of diverse applications, we propose a heterogeneous multicore architecture with a Dynamic Core Morphing (DCM) capability. Depending on the computational demands of the currently executing applications, the resources of a few tightly coupled cores are morphed at runtime. We present a simple hardware-based algorithm to monitor the time-varying computational needs of the application and when deemed beneficial, trigger reconfiguration of the cores at fine-grain time scales to maximize the performance/watt of the application. The proposed dynamic scheme is then compared against a baseline static heterogeneous multicore configuration and an equivalent homogeneous configuration. Our results show that dynamic morphing of cores can provide performance/watt gains of 43% and 16% on an average, when compared to the homogeneous and baseline heterogeneous configurations, respectively.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127914422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

MCFQ: Leveraging Memory-level Parallelism and Application's Cache Friendliness for Efficient Management of Quasi-partitioned Last-level Caches MCFQ:利用内存级并行性和应用程序的缓存友好性来有效管理准分区的最后一级缓存

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.74

Dimitris Kaseridis, M. Iqbal, Jeffrey Stuecheli, L. John

引用次数: 4

Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control 通过异步数据转换和自适应控制增强动态仿真的数据局部性

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.56

Bo Wu, E. Zhang, Xipeng Shen

引用次数: 19

Exploiting Task Order Information for Optimizing Sequentially Consistent Java Programs 利用任务顺序信息优化顺序一致的Java程序

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.70

C. Angerer, T. Gross

{"title":"Exploiting Task Order Information for Optimizing Sequentially Consistent Java Programs","authors":"C. Angerer, T. Gross","doi":"10.1109/PACT.2011.70","DOIUrl":"https://doi.org/10.1109/PACT.2011.70","url":null,"abstract":"Java was designed as a secure language that supports running untrusted code as part of trusted applications. For safety reasons, Java therefore defines a memory model that prevents undefined behavior in multi-threaded programs even if the programs are not correctly synchronized. Because of the potential negative performance impact the Java designers did not choose a simple and natural memory model, such as sequential consistency, but instead developed a relaxed memory model that gives the compiler more optimization opportunities. As it is today, however, the relaxed Java Memory Model is not only hard to understand but it unnecessarily complicates reasoning about parallel programs and it turned out to be difficult to implement correctly. This paper presents an optimizing compiler for a Java version that has sequential consistency as its memory model. Based on a programming model with explicit happens-before constraints between tasks, we describe a static schedule analysis that computes whether two tasks may be executed in parallel or if they are ordered. During optimization, the task-ordering information is exploited to reduce the number of volatile memory accesses the compiler must insert to guarantee sequential consistency. The evaluation shows that scheduling information significantly improves the effectiveness of the optimizations. For our set of multi-threaded benchmarks the fully optimizing compiler removes between 70% and 100% of the volatile memory accesses inserted by the non-optimizing compiler. As a result, the overhead of sequentially consistent Java compared to standard Java is reduced from 136% on average for the unoptimized version to 11% on average for the optimized version. The results indicate that with appropriate optimizations, sequential consistency can be a feasible alternative to the Java Memory Model.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125020476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

SFMalloc: A Lock-Free and Mostly Synchronization-Free Dynamic Memory Allocator for Manycores SFMalloc:一个无锁且无同步的多核动态内存分配器

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.57

Sangmin Seo, Junghyun Kim, Jaejin Lee

{"title":"SFMalloc: A Lock-Free and Mostly Synchronization-Free Dynamic Memory Allocator for Manycores","authors":"Sangmin Seo, Junghyun Kim, Jaejin Lee","doi":"10.1109/PACT.2011.57","DOIUrl":"https://doi.org/10.1109/PACT.2011.57","url":null,"abstract":"As parallel programming becomes the mainstream due to multicore processors, dynamic memory allocators used in C and C++ can suppress the performance of multi-threaded applications if they are not scalable. In this paper, we present a new dynamic memory allocator for multi-threaded applications. The allocator never uses any synchronization for common cases. It uses only lock-free synchronization mechanisms for uncommon cases. Each thread owns a private heap and handles memory requests on the heap. Our allocator is completely synchronization-free when a thread allocates a memory block and deal locates it by itself. Synchronization-free means that threads do not communicate with each other at all. On the other hand, if a thread allocates a block and another thread frees it, we use a lock-free stack to atomically add it to the owner thread's heap to avoid the memory blowup problem. Furthermore, our allocator exploits various memory block caching mechanisms to reduce the latency of memory management. Freed blocks or intermediate memory chunks are cached hierarchically in each thread's heap and they are used for future memory allocation. We compare the performance and scalability of our allocator to those of well-known existing multi-threaded memory allocators using eight benchmarks. Experimental results on a 48-core AMD system show that our approach achieves better performance than other allocators for all benchmarks and is highly scalable with a large number of threads.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127895549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU 基于多核CPU和GPU的高效并行图探索

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.14

Sungpack Hong, Tayo Oguntebi, K. Olukotun

{"title":"Efficient Parallel Graph Exploration on Multi-Core CPU and GPU","authors":"Sungpack Hong, Tayo Oguntebi, K. Olukotun","doi":"10.1109/PACT.2011.14","DOIUrl":"https://doi.org/10.1109/PACT.2011.14","url":null,"abstract":"Graphs are a fundamental data representation that has been used extensively in various domains. In graph-based applications, a systematic exploration of the graph such as a breadth-first search (BFS) often serves as a key component in the processing of their massive data sets. In this paper, we present a new method for implementing the parallel BFS algorithm on multi-core CPUs which exploits a fundamental property of randomly shaped real-world graph instances. By utilizing memory bandwidth more efficiently, our method shows improved performance over the current state-of-the-art implementation and increases its advantage as the size of the graph increases. We then propose a hybrid method which, for each level of the BFS algorithm, dynamically chooses the best implementation from: a sequential execution, two different methods of multicore execution, and a GPU execution. Such a hybrid approach provides the best performance for each graph size while avoiding poor worst-case performance on high-diameter graphs. Finally, we study the effects of the underlying architecture on BFS performance by comparing multiple CPU and GPU systems, a high-end GPU system performed as well as a quad-socket high-end CPU system.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128018495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 290

An Architecture to Enable Lifetime Full Chip Testability in Chip Multiprocessors 在芯片多处理器中实现终身全芯片可测试性的架构

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.52

Rance Rodrigues, I. Koren, S. Kundu

引用次数: 0

SPATL: Honey, I Shrunk the Coherence Directory 亲爱的，我缩小了一致性目录

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.10

Hongzhou Zhao, Arrvindh Shriraman, S. Dwarkadas, V. Srinivasan

{"title":"SPATL: Honey, I Shrunk the Coherence Directory","authors":"Hongzhou Zhao, Arrvindh Shriraman, S. Dwarkadas, V. Srinivasan","doi":"10.1109/PACT.2011.10","DOIUrl":"https://doi.org/10.1109/PACT.2011.10","url":null,"abstract":"One of the key scalability challenges of on-chip coherence in a multicore chip is the coherence directory, which provides information on sharing of cache blocks. Shadow tags that duplicate entire private cache tag arrays are widely used to minimize area overhead, but require an energy-intensive associative search to obtain the sharing information. Recent research proposed a Tagless directory, which uses bloom filters to summarize the tags in a cache set. The Tagless directory associates the sharing vector with the bloom filter buckets to completely eliminate the associative lookup and reduce the directory overhead. However, Tagless still uses a full map sharing vector to represent the sharing information, resulting in remaining area and energy challenges with increasing core counts. In this paper, we first show that due to the regular nature of applications, many bloom filters essentially replicate the same sharing pattern. We next exploit the pattern commonality and propose SPATL (Sharing-pattern based Tagless Directory). SPATL exploits the sharing pattern commonality to decouple the sharing patterns from the bloom filters and eliminates the redundant copies of sharing patterns. SPATL works with both inclusive and noninclusive shared caches and provides 34% storage savings over Tagless, the previous most storage-efficient directory, at 16 cores. We study multiple strategies to periodically eliminate the false sharing that comes from combining sharing pattern compression with Tagless, and demonstrate that SPATL can achieve the same level of false sharers as Tagless with 5% extra bandwidth. Finally, we demonstrate that SPATL scales even better than an idealized directory and can support 1024-core chips with less than 1% of the private cache space for data parallel applications.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"325 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131758586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 56