2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献_第3页

Cost effective physical register sharing 具有成本效益的物理寄存器共享

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI: 10.1109/HPCA.2016.7446105

Arthur Perais, André Seznec

{"title":"Cost effective physical register sharing","authors":"Arthur Perais, André Seznec","doi":"10.1109/HPCA.2016.7446105","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446105","url":null,"abstract":"Sharing a physical register between several instructions is needed to implement several microarchitectural optimizations. However, register sharing requires modifications to the register reclaiming process: Committing a single instruction does not guarantee that the physical register allocated to the previous mapping of its architectural destination register is free-able anymore. Consequently, a form of register reference counting must be implemented. While such mechanisms (e.g., dependency matrix, per register counters) have been described in the literature, we argue that they either require too much storage, or that they lengthen branch misprediction recovery by requiring sequential rollback. As an alternative, we present the Inflight Shared Register Buffer (ISRB), a new structure for register reference counting. The ISRB has low storage overhead and lends itself to checkpoint-based recovery schemes, therefore allowing fast recovery on pipeline flushes. We illustrate our scheme with Move Elimination (short-circuiting moves) and an implementation of Speculative Memory Bypassing (short-circuiting store-load pairs) that makes use of a TAGE-like predictor to identify memory dependencies. We show that the whole potential of these two mechanisms can be achieved with a small register tracking structure.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125978926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

SizeCap: Efficiently handling power surges in fuel cell powered data centers SizeCap:有效处理燃料电池供电数据中心的功率激增

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI: 10.1109/HPCA.2016.7446085

Y. Li, Di Wang, Saugata Ghose, Jie Liu, Sriram Govindan, Sean James, Eric Peterson, J. Siegler, Rachata Ausavarungnirun, O. Mutlu

{"title":"SizeCap: Efficiently handling power surges in fuel cell powered data centers","authors":"Y. Li, Di Wang, Saugata Ghose, Jie Liu, Sriram Govindan, Sean James, Eric Peterson, J. Siegler, Rachata Ausavarungnirun, O. Mutlu","doi":"10.1109/HPCA.2016.7446085","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446085","url":null,"abstract":"Fuel cells are a promising power source for future data centers, offering high energy efficiency, low greenhouse gas emissions, and high reliability. However, due to mechanical limitations related to fuel delivery, fuel cells are slow to adjust to sudden increases in data center power demands, which can result in temporary power shortfalls. To mitigate the impact of power shortfalls, prior work has proposed to either perform power capping by throttling the servers, or to leverage energy storage devices (ESDs) that can temporarily provide enough power to make up for the shortfall while the fuel cells ramp up power generation. Both approaches have disadvantages: power capping conservatively limits server performance and can lead to service level agreement (SLA) violations, while ESD-only solutions must significantly overprovision the energy storage device capacity to tolerate the shortfalls caused by the worst-case (i.e., largest) power surges, which greatly increases the total cost of ownership (TCO). We propose SizeCap, the first ESD sizing framework for fuel cell powered data centers, which coordinates ESD sizing with power capping to enable a cost-effective solution to power shortfalls in data centers. SizeCap sizes the ESD just large enough to cover the majority of power surges, but not the worst-case surges that occur infrequently, to greatly reduce TCO. It then uses the smaller capacity ESD in conjunction with power capping to cover the power shortfalls caused by the worst-case power surges. As part of our new flexible framework, we propose multiple power capping policies with different degrees of awareness of fuel cell and workload behavior, and evaluate their impact on workload performance and ESD size. Using traces from Microsoft's production data center systems, we demonstrate that SizeCap significantly reduces the ESD size (by 85%ofor a workload with infrequent yet large power surges, and by 50% for a workload with frequent power surges) without violating any SLAs.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123798060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

RADAR: Runtime-assisted dead region management for last-level caches 雷达:最后一级缓存的运行时辅助死区管理

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI: 10.1109/HPCA.2016.7446101

M. Manivannan, Vassilis D. Papaefstathiou, M. Pericàs, P. Stenström

{"title":"RADAR: Runtime-assisted dead region management for last-level caches","authors":"M. Manivannan, Vassilis D. Papaefstathiou, M. Pericàs, P. Stenström","doi":"10.1109/HPCA.2016.7446101","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446101","url":null,"abstract":"Last-level caches (LLCs) bridge the processor/memory speed gap and reduce energy consumed per access. Unfortunately, LLCs are poorly utilized because of the relatively large occurrence of dead blocks. We propose RADAR, a hybrid static/dynamic dead-block management technique that can accurately predict and evict dead blocks in LLCs. RADAR does dead-block prediction and eviction at the granularity of address regions supported in many of today's task-parallel programming models. The runtime system utilizes static control-flow information about future region accesses in conjunction with past region access patterns to make accurate predictions about dead regions. The runtime system instructs the cache to demote and eventually evict blocks belonging to such dead regions. This paper considers three RADAR schemes to predict dead regions: a scheme that uses control-flow information provided by the programming model (Look-ahead), a history-based scheme (Look-back) and a combined scheme (Look-ahead and Look-back). Our evaluation shows that, on average, all RADAR schemes outperform state-of-the-art hardware dead-block prediction techniques, whereas the combined scheme always performs best.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128271879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing 同步多内核GPU:通过细粒度共享的多任务吞吐量处理器

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI: 10.1109/HPCA.2016.7446078

Zhenning Wang, Jun Yang, R. Melhem, B. Childers, Youtao Zhang, M. Guo

{"title":"Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing","authors":"Zhenning Wang, Jun Yang, R. Melhem, B. Childers, Youtao Zhang, M. Guo","doi":"10.1109/HPCA.2016.7446078","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446078","url":null,"abstract":"Studies show that non-graphics programs can be less optimized for the GPU hardware, leading to significant resource under-utilization. Sharing the GPU among multiple programs can effectively improve utilization, which is particularly attractive to systems where many applications require access to the GPU (e.g., cloud computing). However, current GPUs lack proper architecture features to support sharing. Initial attempts are preliminary: They either provide only static sharing, which requires recompilation or code transformation, or they do not effectively improve GPU resource utilization. We propose Simultaneous Multikernel (SMK), a fine-grain dynamic sharing mechanism, that fully utilizes resources within a streaming multiprocessor by exploiting heterogeneity of different kernels. We propose several resource allocation strategies to improve system throughput while maintaining fairness. Our evaluation shows that for shared workloads with complementary resource occupancy, SMK improves GPU throughput by 52% over non-shared execution and 17% over a state-of-the-art design.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114134145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 129

Best-offset hardware prefetching 最佳偏移量硬件预取

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI: 10.1109/HPCA.2016.7446087

P. Michaud

引用次数: 125

ChargeCache: Reducing DRAM latency by exploiting row access locality ChargeCache:通过利用行访问局部性来减少DRAM延迟

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI: 10.1109/HPCA.2016.7446096

Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, V. Seshadri, Donghyuk Lee, O. Ergin, O. Mutlu

引用次数: 138

MaPU: A novel mathematical computing architecture MaPU:一种新颖的数学计算架构

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI: 10.1109/HPCA.2016.7446086

Donglin Wang, Xueliang Du, Leizu Yin, Chen Lin, Hong Ma, Weili Ren, Huijuan Wang, Xingang Wang, Shaolin Xie, L. Wang, Zijun Liu, Tao Wang, Zhonghua Pu, Guangxin Ding, Mengchen Zhu, Lipeng Yang, Ruoshan Guo, Zhiwei Zhang, Xiao Lin, Jie Hao, Yongyong Yang, Wenqin Sun, Fabiao Zhou, NuoZhou Xiao, Q. Cui, Xiaoqin Wang

{"title":"MaPU: A novel mathematical computing architecture","authors":"Donglin Wang, Xueliang Du, Leizu Yin, Chen Lin, Hong Ma, Weili Ren, Huijuan Wang, Xingang Wang, Shaolin Xie, L. Wang, Zijun Liu, Tao Wang, Zhonghua Pu, Guangxin Ding, Mengchen Zhu, Lipeng Yang, Ruoshan Guo, Zhiwei Zhang, Xiao Lin, Jie Hao, Yongyong Yang, Wenqin Sun, Fabiao Zhou, NuoZhou Xiao, Q. Cui, Xiaoqin Wang","doi":"10.1109/HPCA.2016.7446086","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446086","url":null,"abstract":"As the feature size of the semiconductor process is scaling down to 10nm and below, it is possible to assemble systems with high performance processors that can theoretically provide computational power of up to tens of PLOPS. However, the power consumption of these systems is also rocketing up to tens of millions watts, and the actual performance is only around 60% of the theoretical performance. Today, power efficiency and sustained performance have become the main foci of processor designers. Traditional computing architecture such as superscalar and GPGPU are proven to be power inefficient, and there is a big gap between the actual and peak performance. In this paper, we present the MaPU architecture, a novel architecture which is suitable for data-intensive computing with great power efficiency and sustained computation throughput. To achieve this goal, MaPU attempts to optimize the application from a system perspective, including the hardware, algorithm and corresponding program model. It uses an innovative multi-granularity parallel memory system with intrinsic shuffle ability, cascading pipelines with wide SIMD data paths and a state-machine-based program model. When executing typical signal processing algorithms, a single MaPU core implemented with a 40nm process exhibits a sustained performance of 134 GLOPS while consuming only 2.8 W in power, which increases the actual power efficiency by an order of magnitude comparable with the traditional CPU and GPGPU.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114728555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family 缓存QoS:从概念到现实在英特尔®至强®处理器E5-2600 v3产品系列

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI: 10.1109/HPCA.2016.7446102

Andrew J. Herdrich, Edwin Verplanke, Priya Autee, R. Illikkal, C. Gianos, Ronak Singhal, R. Iyer

{"title":"Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family","authors":"Andrew J. Herdrich, Edwin Verplanke, Priya Autee, R. Illikkal, C. Gianos, Ronak Singhal, R. Iyer","doi":"10.1109/HPCA.2016.7446102","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446102","url":null,"abstract":"Over the last decade, addressing quality of service (QoS) in multi-core server platforms has been growing research topic. QoS techniques have been proposed to address the shared resource contention between co-running applications or virtual machines in servers and thereby provide better isolation, performance determinism and potentially improve overall throughput. One of the most important shared resources is cache space. Most proposals for addressing shared cache contention are based on simulations and analysis and no commercial platforms were available that integrated such techniques and provided a practical solution. In this paper, we will present the first set of shared cache QoS techniques designed and implemented in state-of-the-art commercial servers (the Intel® Xeon® processor E5-2600 v3 product family). We will describe two key technologies: (i) Cache Monitoring Technology (CMT) to enable monitoring of shared cache usage by different applications and (ii) Cache Allocation Technology (CAT) which enables redistribution of shared cache space between applications to address contention. This is the first paper to describing these techniques as they moved from concept to reality, starting from early research to product implementation. We will also present case studies highlighting the value of these techniques using example scenarios of multi-programmed workloads, virtualized platforms in datacenters and communications platforms. Finally, we will describe initial software infrastructure and enabling for industry practitioners and researchers to take advantage of these technologies for their QoS needs.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125397101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 135

Parity Helix: Efficient protection for single-dimensional faults in multi-dimensional memory systems 奇偶螺旋:对多维存储系统中单维故障的有效保护

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI: 10.1109/HPCA.2016.7446094

Xun Jian, Vilas Sridharan, Rakesh Kumar

{"title":"Parity Helix: Efficient protection for single-dimensional faults in multi-dimensional memory systems","authors":"Xun Jian, Vilas Sridharan, Rakesh Kumar","doi":"10.1109/HPCA.2016.7446094","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446094","url":null,"abstract":"Emerging die-stacked DRAMs provide several factors higher bandwidth and energy efficiency than 2D DRAMs, making them excellent candidates for future memory systems. To be deployed in server and high-performance computing systems, however, die-stacked DRAMs need to provide equivalent or better reliability than existing 2D DRAMs. This includes protecting against channel and die faults, which have been observed in existing 2D DRAM production systems. In this paper, we observe that memory subsystems can be viewed as a multi-dimensional collection of memory banks in which faults generally affect memory banks that lie along a single dimension. For instance, in die-stacked DRAMs, a die consists of a group of DRAM banks that lie in a horizontal plane while a channel consists of a vertical group of banks spanning across multiple dies. We exploit this fault behavior to propose Parity Helix to efficiently protect against single-dimensional faults in multi-dimensional memory systems. Parity Helix shares the same error correction resources across all dimensions to minimize error correction overheads. For die-stacked DRAMs, our evaluation shows that compared to a straightforward extension of previous schemes, Parity Helix increases memory capacity by 16.7%, reduces memory energy per program access by 21%, on average, and by up to 45%.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121499844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Software transparent dynamic binary translation for coarse-grain reconfigurable architectures 面向粗粒度可重构体系结构的软件透明动态二进制转换

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI: 10.1109/HPCA.2016.7446060

Matthew A. Watkins, Tony Nowatzki, Anthony Carno

{"title":"Software transparent dynamic binary translation for coarse-grain reconfigurable architectures","authors":"Matthew A. Watkins, Tony Nowatzki, Anthony Carno","doi":"10.1109/HPCA.2016.7446060","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446060","url":null,"abstract":"The end of Dennard Scaling has forced architects to focus on designing for execution efficiency. Course-grained reconfigurable architectures (CGRAs) are a class of architectures that provide a configurable grouping of functional units that aim to bridge the gap between the power and performance of custom hardware and the flexibility of software. Despite their potential benefit, CGRAs face a major adoption challenge as they do not execute a standard instruction stream. Dynamic translation for CGRAs has the potential to solve this problem, but faces non-trivial challenges. Existing attempts either do not achieve the full power and performance potential CGRAs offer or suffer from excessive translation time. In this work we propose DORA, a Dynamic Optimizer for Reconfigurable Architectures, which achieves substantial (2X) power and performance improvements while having low hardware and insertion overhead and benefiting the current execution. In addition to traditional optimizations, DORA leverages dynamic register information to perform optimizations not available to compilers and achieves performance similar to or better than CGRA-targeted compiled code.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126565126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20