IEEE International Symposium on High-Performance Comp Architecture最新文献_第2页

Flexible register management using reference counting 灵活的寄存器管理使用引用计数

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169033

Steven J. Battle, Andrew D. Hilton, Mark Hempstead, A. Roth

引用次数: 12

π-TM: Pessimistic invalidation for scalable lazy hardware transactional memory π-TM:可伸缩惰性硬件事务性内存的悲观失效

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168951

A. Negi, J. Gil, M. Acacio, José M. García, P. Stenström

{"title":"π-TM: Pessimistic invalidation for scalable lazy hardware transactional memory","authors":"A. Negi, J. Gil, M. Acacio, José M. García, P. Stenström","doi":"10.1109/HPCA.2012.6168951","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168951","url":null,"abstract":"Lazy hardware transactional memory has been shown to be more efficient at extracting available concurrency than its eager counterpart. However, it poses scalability challenges at commit time as existence of conflicts among concurrent transactions is not known prior to commit. Non-conflicting transactions may have to wait before committing, severely affecting performance in certain workloads. Early conflict detection can be employed to allow such transactions to commit simultaneously. In this paper we show that the potential of this technique has not yet been fully utilized, with design choices in prior work severely burdening common-case transactional execution to avoid some relatively uncommon correctness concerns. The paper quantifies the severity of the problem and develops μ-TM, an early conflict detection - lazy conflict resolution design. This design highlights how, with modest extensions to existing directory-based coherence protocols, information regarding possible conflicts can be effectively used to achieve true parallelism at commit without burdening the common-case. We leverage the observation that contention is typically seen on only a small fraction of shared data accessed by coarse-grained transactions. Pessimistic invalidation of such lines when committing or aborting, therefore, enables fast common-case execution. Our results show that μ-TM performs consistently well and, in particular, far better than previous work on early conflict detection in lazy HTM. We also identify a pathological scenario that lazy designs with early conflict detection suffer from and propose a simple hardware workaround to sidestep it.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132930900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Architectural support for synchronization-free deterministic parallel programming 对无同步确定性并行编程的体系结构支持

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169038

Cedomir Segulja, T. Abdelrahman

引用次数: 12

Cache restoration for highly partitioned virtualized systems 高分区虚拟化系统的缓存恢复

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169029

D. Daly, Harold W. Cain

{"title":"Cache restoration for highly partitioned virtualized systems","authors":"D. Daly, Harold W. Cain","doi":"10.1109/HPCA.2012.6169029","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169029","url":null,"abstract":"The economics of server consolidation have led to the support of virtualization features in almost all server-class systems, with the related feature set being a subject of significant competition. While most systems allow for partitioning at the relatively coarse grain of a single core, some systems also support multiprogrammed virtualization, whereby a system can be more finely partitioned through time-sharing, down to a percentage of a core being allotted to a virtual machine. When multiple virtual machines share a single core however, performance can suffer due to the displacement of microarchitectural state. We introduce cache restoration, a hardware-based prefetching mechanism initiated by the underlying virtualization software when a virtual machine is being scheduled on a core, prefetching its working set and warming its initial environment. Through cycle-accurate simulation of a POWER7 system, we show that when applied to its private per-core L3 last-level cache, the warm cache translates into 20% on average performance improvement for a mixture of workloads on a highly partitioned core, compared to a virtualized server without cache restoration.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115285297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

SCD: A scalable coherence directory with flexible sharer set encoding SCD:具有灵活共享集编码的可伸缩一致性目录

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168950

Daniel Sánchez, C. Kozyrakis

{"title":"SCD: A scalable coherence directory with flexible sharer set encoding","authors":"Daniel Sánchez, C. Kozyrakis","doi":"10.1109/HPCA.2012.6168950","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168950","url":null,"abstract":"Large-scale CMPs with hundreds of cores require a directory-based protocol to maintain cache coherence. However, previously proposed coherence directories are hard to scale beyond tens of cores, requiring either excessive area or energy, complex hierarchical protocols, or inexact representations of sharer sets that increase coherence traffic and degrade performance. We present SCD, a scalable coherence directory that relies on efficient highly-associative caches (such as zcaches) to implement a single-level directory that scales to thousands of cores, tracks sharer sets exactly, and incurs negligible directory-induced invalidations. SCD scales because, unlike conventional directories, it uses a variable number of directory tags to represent sharer sets: lines with one or few sharers use a single tag, while widely shared lines use additional tags, so tags remain small as the system scales up. We show that, thanks to the efficient highly-associative array it relies on, SCD can be fully characterized using analytical models, and can be sized to guarantee a negligible number of evictions independently of the workload. We evaluate SCD using simulations of a 1024-core CMP. For the same level of coverage, we find that SCD is 13× more area-efficient than full-map sparse directories, and 2× more area-efficient and faster than hierarchical directories, while requiring a simpler protocol. Furthermore, we show that SCD's analytical models are accurate in practice.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115309350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 111

Staged Reads: Mitigating the impact of DRAM writes on DRAM reads 分阶段读取:减轻DRAM写入对DRAM读取的影响

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168943

Niladrish Chatterjee, Naveen Muralimanohar, R. Balasubramonian, A. Davis, N. Jouppi

{"title":"Staged Reads: Mitigating the impact of DRAM writes on DRAM reads","authors":"Niladrish Chatterjee, Naveen Muralimanohar, R. Balasubramonian, A. Davis, N. Jouppi","doi":"10.1109/HPCA.2012.6168943","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168943","url":null,"abstract":"Main memory latencies have always been a concern for system performance. Given that reads are on the critical path for CPU progress, reads must be prioritized over writes. However, writes must be eventually processed and they often delay pending reads. In fact, a single channel in the main memory system offers almost no parallelism between reads and writes. This is because a single off-chip memory bus is shared by reads and writes and the direction of the bus has to be explicitly turned around when switching from writes to reads. This is an expensive operation and its cost is amortized by carrying out a burst of writes or reads every time the bus direction is switched. As a result, no reads can be processed while a memory channel is busy servicing writes. This paper proposes a novel mechanism to boost read-write parallelism and perform useful components of read operations even when the memory system is busy performing writes. If some of the banks are busy servicing writes, we start issuing reads to the other idle banks. The results of these reads are stored in a few registers near the memory chip's I/O pads. These results are quickly returned immediately following the bus turnaround. The process is referred to as a Staged Read because it decouples a single read operation into two stages, with the first step being performed in parallel with writes. This innovation can also be viewed as a form of prefetch that is internal to a memory chip. The proposed technique works best when there is bank imbalance in the write stream. We also introduce a write scheduling algorithm that artificially creates bank imbalance and allows useful read operations to be performed during the write drain. Across a suite of memory-intensive workloads, we show that Staged Reads can boost throughput by up to 33% (average 7%) with an average DRAM access latency improvement of 17%, while incurring a very small cost (0.25%) in terms of memory chip area. The throughput improvements are even greater when considering write-intensive workloads (average 11%) or future systems (average 12%).","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125586447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

Computational sprinting 计算短跑

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169031

Arun Raghavan, Yixin Luo, Anuj Chandawalla, M. Papaefthymiou, K. Pipe, T. Wenisch, Milo M. K. Martin

{"title":"Computational sprinting","authors":"Arun Raghavan, Yixin Luo, Anuj Chandawalla, M. Papaefthymiou, K. Pipe, T. Wenisch, Milo M. K. Martin","doi":"10.1109/HPCA.2012.6169031","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169031","url":null,"abstract":"Although transistor density continues to increase, voltage scaling has stalled and thus power density is increasing each technology generation. Particularly in mobile devices, which have limited cooling options, these trends lead to a utilization wall in which sustained chip performance is limited primarily by power rather than area. However, many mobile applications do not demand sustained performance; rather they comprise short bursts of computation in response to sporadic user activity. To improve responsiveness for such applications, this paper explores activating otherwise powered-down cores for sub-second bursts of intense parallel computation. The approach exploits the concept of computational sprinting, in which a chip temporarily exceeds its sustainable thermal power budget to provide instantaneous throughput, after which the chip must return to nominal operation to cool down. To demonstrate the feasibility of this approach, we analyze the thermal and electrical characteristics of a smart-phone-like system that nominally operates a single core (~1W peak), but can sprint with up to 16 cores for hundreds of milliseconds. We describe a thermal design that incorporates phase-change materials to provide thermal capacitance to enable such sprints. We analyze image recognition kernels to show that parallel sprinting has the potential to achieve the task response time of a 16W chip within the thermal constraints of a 1W mobile platform.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128650974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 181

Design, integration and implementation of the DySER hardware accelerator into OpenSPARC 在OpenSPARC中设计、集成和实现daser硬件加速器

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168949

Jesse Benson, Ryan Cofell, Chris Frericks, C. Ho, Venkatraman Govindaraju, Tony Nowatzki, K. Sankaralingam

{"title":"Design, integration and implementation of the DySER hardware accelerator into OpenSPARC","authors":"Jesse Benson, Ryan Cofell, Chris Frericks, C. Ho, Venkatraman Govindaraju, Tony Nowatzki, K. Sankaralingam","doi":"10.1109/HPCA.2012.6168949","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168949","url":null,"abstract":"Accelerators and specialization in various forms are emerging as a way to increase processor performance. Examples include Navigo, Conservation-Cores, BERET, and DySER. While each of these employ different primitives and principles to achieve specialization, they share some common concerns with regards to implementation. Two of these concerns are: how to integrate them with a commercial processor and how to develop their compiler toolchain. This paper undertakes an implementation study of one design point: integration of DySER into OpenSPARC, a design we call OpenSPlySER. We report on our implementation exercise and quantitative results, and conclude with a set of our lessons learned. We demonstrate that DySER delivers on its goal of providing a non-intrusive accelerator design. OpenSPlySERruns on an Virtex-5 FPGA, boots unmodified Linux, and runs most of the SPECINT benchmarks with our compiler. Due to physical design constraints, speedups on full benchmarks are modest for the FPGA prototype. On targeted microbenchmarks, OpenSPlySER delivers up to a 31-fold speedup over the baseline OpenSPARC. We conclude with some lessons learned from this somewhat unique exercise of significantly modifying a commercial processor. To the best of our knowledge, this work is one of the most ambitious extensions of OpenSPARC.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115406784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

Supporting efficient collective communication in NoCs 支持noc中有效的集体沟通

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168953

Sheng Ma, Natalie D. Enright Jerger, Zhiying Wang

{"title":"Supporting efficient collective communication in NoCs","authors":"Sheng Ma, Natalie D. Enright Jerger, Zhiying Wang","doi":"10.1109/HPCA.2012.6168953","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168953","url":null,"abstract":"Across many architectures and parallel programming paradigms, collective communication plays a key role in performance and correctness. Hardware support is necessary to prevent important collective communication from becoming a system bottleneck. Support for multicast communication in Networks-on-Chip (NoCs) has achieved substantial throughput improvements and power savings. In this paper, we explore support for reduction or many-to-one communication operations. As a case study, we focus on acknowledgement messages (ACK) that must be collected in a directory protocol before a cache line may be upgraded to or installed in the modified state. This paper makes two primary contributions: an efficient framework to support the reduction of ACK packets and a novel Balanced, Adaptive Multicast (BAM) routing algorithm. The proposed message combination framework complements several multicast algorithms. By combining ACK packets during transmission, this framework not only reduces packet latency by 14.1% for low-to-medium network loads, but also improves the network saturation throughput by 9.6% with little overhead. The balanced buffer resource configuration of BAM improves the saturation throughput by an additional 13.8%. For the PARSEC benchmarks, our design offers an average speedup of 12.7% and a maximal speedup of 16.8%.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117003215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Decoupled dynamic cache segmentation 解耦动态缓存分段

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169030

S. Khan, Zhe Wang, Daniel A. Jiménez

引用次数: 28