IEEE International Symposium on High-Performance Comp Architecture最新文献

筛选
英文 中文
Flexible register management using reference counting 灵活的寄存器管理使用引用计数
IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169033
Steven J. Battle, Andrew D. Hilton, Mark Hempstead, A. Roth
{"title":"Flexible register management using reference counting","authors":"Steven J. Battle, Andrew D. Hilton, Mark Hempstead, A. Roth","doi":"10.1109/HPCA.2012.6169033","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169033","url":null,"abstract":"Conventional out-of-order processors that use a unified physical register file allocate and reclaim registers explicitly using a free list that operates as a circular queue. We describe and evaluate a more flexible register management scheme - reference counting. We implement reference counting using a bit-matrix with a column for every physical register and a row for every entity that can hold a physical register, e.g., an in-flight instruction. Columns are NOR'ed together to create a bitvector free list from which registers are allocated using priority encoders. We describe reference counting designs that support micro-architectural techniques including register file power gating, dynamic register move elimination, register file checkpointing, and latency tolerant execution. Performance and circuit simulation show that the energy cost of reference counting is low and is easily recouped by the savings of the techniques it enables.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116008039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
π-TM: Pessimistic invalidation for scalable lazy hardware transactional memory π-TM:可伸缩惰性硬件事务性内存的悲观失效
IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168951
A. Negi, J. Gil, M. Acacio, José M. García, P. Stenström
{"title":"π-TM: Pessimistic invalidation for scalable lazy hardware transactional memory","authors":"A. Negi, J. Gil, M. Acacio, José M. García, P. Stenström","doi":"10.1109/HPCA.2012.6168951","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168951","url":null,"abstract":"Lazy hardware transactional memory has been shown to be more efficient at extracting available concurrency than its eager counterpart. However, it poses scalability challenges at commit time as existence of conflicts among concurrent transactions is not known prior to commit. Non-conflicting transactions may have to wait before committing, severely affecting performance in certain workloads. Early conflict detection can be employed to allow such transactions to commit simultaneously. In this paper we show that the potential of this technique has not yet been fully utilized, with design choices in prior work severely burdening common-case transactional execution to avoid some relatively uncommon correctness concerns. The paper quantifies the severity of the problem and develops μ-TM, an early conflict detection - lazy conflict resolution design. This design highlights how, with modest extensions to existing directory-based coherence protocols, information regarding possible conflicts can be effectively used to achieve true parallelism at commit without burdening the common-case. We leverage the observation that contention is typically seen on only a small fraction of shared data accessed by coarse-grained transactions. Pessimistic invalidation of such lines when committing or aborting, therefore, enables fast common-case execution. Our results show that μ-TM performs consistently well and, in particular, far better than previous work on early conflict detection in lazy HTM. We also identify a pathological scenario that lazy designs with early conflict detection suffer from and propose a simple hardware workaround to sidestep it.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132930900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Architectural support for synchronization-free deterministic parallel programming 对无同步确定性并行编程的体系结构支持
IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169038
Cedomir Segulja, T. Abdelrahman
{"title":"Architectural support for synchronization-free deterministic parallel programming","authors":"Cedomir Segulja, T. Abdelrahman","doi":"10.1109/HPCA.2012.6169038","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169038","url":null,"abstract":"We propose a novel synchronization mechanism called versioning. It dynamically establishes a deterministic order of memory accesses in parallel programs that have serial semantics, in a way that is transparent to the programmer. This order is created in a distributed manner and is enforced by monitoring memory accesses and stalling threads if necessary. Versioning gives rise to parallel programming models in which programmers need not explicitly synchronize threads and only need to specify shared data, which greatly simplifies parallel programming. However, versioning introduces overheads and thus demands architectural support. We describe versioning and the architectural support it needs. We also propose one parallel programming model that utilizes versioning and use it to parallelize 13 benchmark applications. We build an FPGA prototype of a multiprocessor system with versioning support and show that good parallel speedups are obtained. Our analysis shows minimal impact of versioning, both in terms of timing overheads and in terms of additional hardware.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115634681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Cache restoration for highly partitioned virtualized systems 高分区虚拟化系统的缓存恢复
IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169029
D. Daly, Harold W. Cain
{"title":"Cache restoration for highly partitioned virtualized systems","authors":"D. Daly, Harold W. Cain","doi":"10.1109/HPCA.2012.6169029","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169029","url":null,"abstract":"The economics of server consolidation have led to the support of virtualization features in almost all server-class systems, with the related feature set being a subject of significant competition. While most systems allow for partitioning at the relatively coarse grain of a single core, some systems also support multiprogrammed virtualization, whereby a system can be more finely partitioned through time-sharing, down to a percentage of a core being allotted to a virtual machine. When multiple virtual machines share a single core however, performance can suffer due to the displacement of microarchitectural state. We introduce cache restoration, a hardware-based prefetching mechanism initiated by the underlying virtualization software when a virtual machine is being scheduled on a core, prefetching its working set and warming its initial environment. Through cycle-accurate simulation of a POWER7 system, we show that when applied to its private per-core L3 last-level cache, the warm cache translates into 20% on average performance improvement for a mixture of workloads on a highly partitioned core, compared to a virtualized server without cache restoration.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115285297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
SCD: A scalable coherence directory with flexible sharer set encoding SCD:具有灵活共享集编码的可伸缩一致性目录
IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168950
Daniel Sánchez, C. Kozyrakis
{"title":"SCD: A scalable coherence directory with flexible sharer set encoding","authors":"Daniel Sánchez, C. Kozyrakis","doi":"10.1109/HPCA.2012.6168950","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168950","url":null,"abstract":"Large-scale CMPs with hundreds of cores require a directory-based protocol to maintain cache coherence. However, previously proposed coherence directories are hard to scale beyond tens of cores, requiring either excessive area or energy, complex hierarchical protocols, or inexact representations of sharer sets that increase coherence traffic and degrade performance. We present SCD, a scalable coherence directory that relies on efficient highly-associative caches (such as zcaches) to implement a single-level directory that scales to thousands of cores, tracks sharer sets exactly, and incurs negligible directory-induced invalidations. SCD scales because, unlike conventional directories, it uses a variable number of directory tags to represent sharer sets: lines with one or few sharers use a single tag, while widely shared lines use additional tags, so tags remain small as the system scales up. We show that, thanks to the efficient highly-associative array it relies on, SCD can be fully characterized using analytical models, and can be sized to guarantee a negligible number of evictions independently of the workload. We evaluate SCD using simulations of a 1024-core CMP. For the same level of coverage, we find that SCD is 13× more area-efficient than full-map sparse directories, and 2× more area-efficient and faster than hierarchical directories, while requiring a simpler protocol. Furthermore, we show that SCD's analytical models are accurate in practice.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115309350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 111
Staged Reads: Mitigating the impact of DRAM writes on DRAM reads 分阶段读取:减轻DRAM写入对DRAM读取的影响
IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168943
Niladrish Chatterjee, Naveen Muralimanohar, R. Balasubramonian, A. Davis, N. Jouppi
{"title":"Staged Reads: Mitigating the impact of DRAM writes on DRAM reads","authors":"Niladrish Chatterjee, Naveen Muralimanohar, R. Balasubramonian, A. Davis, N. Jouppi","doi":"10.1109/HPCA.2012.6168943","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168943","url":null,"abstract":"Main memory latencies have always been a concern for system performance. Given that reads are on the critical path for CPU progress, reads must be prioritized over writes. However, writes must be eventually processed and they often delay pending reads. In fact, a single channel in the main memory system offers almost no parallelism between reads and writes. This is because a single off-chip memory bus is shared by reads and writes and the direction of the bus has to be explicitly turned around when switching from writes to reads. This is an expensive operation and its cost is amortized by carrying out a burst of writes or reads every time the bus direction is switched. As a result, no reads can be processed while a memory channel is busy servicing writes. This paper proposes a novel mechanism to boost read-write parallelism and perform useful components of read operations even when the memory system is busy performing writes. If some of the banks are busy servicing writes, we start issuing reads to the other idle banks. The results of these reads are stored in a few registers near the memory chip's I/O pads. These results are quickly returned immediately following the bus turnaround. The process is referred to as a Staged Read because it decouples a single read operation into two stages, with the first step being performed in parallel with writes. This innovation can also be viewed as a form of prefetch that is internal to a memory chip. The proposed technique works best when there is bank imbalance in the write stream. We also introduce a write scheduling algorithm that artificially creates bank imbalance and allows useful read operations to be performed during the write drain. Across a suite of memory-intensive workloads, we show that Staged Reads can boost throughput by up to 33% (average 7%) with an average DRAM access latency improvement of 17%, while incurring a very small cost (0.25%) in terms of memory chip area. The throughput improvements are even greater when considering write-intensive workloads (average 11%) or future systems (average 12%).","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125586447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Computational sprinting 计算短跑
IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169031
Arun Raghavan, Yixin Luo, Anuj Chandawalla, M. Papaefthymiou, K. Pipe, T. Wenisch, Milo M. K. Martin
{"title":"Computational sprinting","authors":"Arun Raghavan, Yixin Luo, Anuj Chandawalla, M. Papaefthymiou, K. Pipe, T. Wenisch, Milo M. K. Martin","doi":"10.1109/HPCA.2012.6169031","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169031","url":null,"abstract":"Although transistor density continues to increase, voltage scaling has stalled and thus power density is increasing each technology generation. Particularly in mobile devices, which have limited cooling options, these trends lead to a utilization wall in which sustained chip performance is limited primarily by power rather than area. However, many mobile applications do not demand sustained performance; rather they comprise short bursts of computation in response to sporadic user activity. To improve responsiveness for such applications, this paper explores activating otherwise powered-down cores for sub-second bursts of intense parallel computation. The approach exploits the concept of computational sprinting, in which a chip temporarily exceeds its sustainable thermal power budget to provide instantaneous throughput, after which the chip must return to nominal operation to cool down. To demonstrate the feasibility of this approach, we analyze the thermal and electrical characteristics of a smart-phone-like system that nominally operates a single core (~1W peak), but can sprint with up to 16 cores for hundreds of milliseconds. We describe a thermal design that incorporates phase-change materials to provide thermal capacitance to enable such sprints. We analyze image recognition kernels to show that parallel sprinting has the potential to achieve the task response time of a 16W chip within the thermal constraints of a 1W mobile platform.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128650974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 181
Design, integration and implementation of the DySER hardware accelerator into OpenSPARC 在OpenSPARC中设计、集成和实现daser硬件加速器
IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168949
Jesse Benson, Ryan Cofell, Chris Frericks, C. Ho, Venkatraman Govindaraju, Tony Nowatzki, K. Sankaralingam
{"title":"Design, integration and implementation of the DySER hardware accelerator into OpenSPARC","authors":"Jesse Benson, Ryan Cofell, Chris Frericks, C. Ho, Venkatraman Govindaraju, Tony Nowatzki, K. Sankaralingam","doi":"10.1109/HPCA.2012.6168949","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168949","url":null,"abstract":"Accelerators and specialization in various forms are emerging as a way to increase processor performance. Examples include Navigo, Conservation-Cores, BERET, and DySER. While each of these employ different primitives and principles to achieve specialization, they share some common concerns with regards to implementation. Two of these concerns are: how to integrate them with a commercial processor and how to develop their compiler toolchain. This paper undertakes an implementation study of one design point: integration of DySER into OpenSPARC, a design we call OpenSPlySER. We report on our implementation exercise and quantitative results, and conclude with a set of our lessons learned. We demonstrate that DySER delivers on its goal of providing a non-intrusive accelerator design. OpenSPlySERruns on an Virtex-5 FPGA, boots unmodified Linux, and runs most of the SPECINT benchmarks with our compiler. Due to physical design constraints, speedups on full benchmarks are modest for the FPGA prototype. On targeted microbenchmarks, OpenSPlySER delivers up to a 31-fold speedup over the baseline OpenSPARC. We conclude with some lessons learned from this somewhat unique exercise of significantly modifying a commercial processor. To the best of our knowledge, this work is one of the most ambitious extensions of OpenSPARC.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115406784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
Supporting efficient collective communication in NoCs 支持noc中有效的集体沟通
IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168953
Sheng Ma, Natalie D. Enright Jerger, Zhiying Wang
{"title":"Supporting efficient collective communication in NoCs","authors":"Sheng Ma, Natalie D. Enright Jerger, Zhiying Wang","doi":"10.1109/HPCA.2012.6168953","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168953","url":null,"abstract":"Across many architectures and parallel programming paradigms, collective communication plays a key role in performance and correctness. Hardware support is necessary to prevent important collective communication from becoming a system bottleneck. Support for multicast communication in Networks-on-Chip (NoCs) has achieved substantial throughput improvements and power savings. In this paper, we explore support for reduction or many-to-one communication operations. As a case study, we focus on acknowledgement messages (ACK) that must be collected in a directory protocol before a cache line may be upgraded to or installed in the modified state. This paper makes two primary contributions: an efficient framework to support the reduction of ACK packets and a novel Balanced, Adaptive Multicast (BAM) routing algorithm. The proposed message combination framework complements several multicast algorithms. By combining ACK packets during transmission, this framework not only reduces packet latency by 14.1% for low-to-medium network loads, but also improves the network saturation throughput by 9.6% with little overhead. The balanced buffer resource configuration of BAM improves the saturation throughput by an additional 13.8%. For the PARSEC benchmarks, our design offers an average speedup of 12.7% and a maximal speedup of 16.8%.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117003215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Decoupled dynamic cache segmentation 解耦动态缓存分段
IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169030
S. Khan, Zhe Wang, Daniel A. Jiménez
{"title":"Decoupled dynamic cache segmentation","authors":"S. Khan, Zhe Wang, Daniel A. Jiménez","doi":"10.1109/HPCA.2012.6169030","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169030","url":null,"abstract":"The least recently used (LRU) replacement policy performs poorly in the last-level cache (LLC) because temporal locality of memory accesses is filtered by first and second level caches. We propose a cache segmentation technique that dynamically adapts to cache access patterns by predicting the best number of not-yet-referenced and already-referenced blocks in the cache. This technique is independent from the LRU policy so it can work with less expensive replacement policies. It can automatically detect when to bypass blocks to the CPU with no extra overhead. In a 2MB LLC single-core processor with a memory intensive subset of SPEC CPU 2006 benchmarks, it outperforms LRU replacement on average by 5.2% with not-recently-used (NRU) replacement and on average by 2.2% with random replacement. The technique also complements existing shared cache partitioning techniques. Our evaluation with 10 multi-programmed workloads shows that this technique improves performance of an 8MB LLC four-core system on average by 12%, with a random replacement policy requiring only half the space of the LRU policy.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123147930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信