2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献_第2页

Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors 延迟和旁路：非顺序处理器中的就绪和临界感知指令调度

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00042

M. Alipour, S. Kaxiras, D. Black-Schaffer, Rakesh Kumar

{"title":"Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors","authors":"M. Alipour, S. Kaxiras, D. Black-Schaffer, Rakesh Kumar","doi":"10.1109/HPCA47549.2020.00042","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00042","url":null,"abstract":"Flexible instruction scheduling is essential for performance in out-of-order processors. This is typically achieved by using CAM-based Instruction Queues (IQs) that provide complete flexibility in choosing ready instructions for execution, but at the cost of significant scheduling energy. In this work we seek to reduce the instruction scheduling energy by reducing the depth and width of the IQ. We do so by classifying instructions based on their readiness and criticality, and using this information to bypass the IQ for instructions that will not benefit from its expensive scheduling structures and delay instructions that will not harm performance. Combined, these approaches allow us to offload a significant portion of the instructions from the IQ to much cheaper FIFO-based scheduling structures without hurting performance. As a result we can reduce the IQ depth and width by half, thereby saving energy. Our design, Delay and Bypass (DNB), is the first design to explicitly address both readiness and criticality to reduce scheduling energy. By handling both classes we are able to achieve 95% of the baseline out-of-order performance while only using 33% of the scheduling energy. This represents a significant improvement over previous designs which addressed only criticality or readiness (91%/89% performance at 74%/53% energy).","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"19 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130430201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

ResiRCA: A Resilient Energy Harvesting ReRAM Crossbar-Based Accelerator for Intelligent Embedded Processors 一种用于智能嵌入式处理器的弹性能量收集ReRAM交叉棒加速器

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00034

Keni Qiu, N. Jao, Mengying Zhao, Cyan Subhra Mishra, Gulsum Gudukbay, Sethu Jose, J. Sampson, M. Kandemir, N. Vijaykrishnan

{"title":"ResiRCA: A Resilient Energy Harvesting ReRAM Crossbar-Based Accelerator for Intelligent Embedded Processors","authors":"Keni Qiu, N. Jao, Mengying Zhao, Cyan Subhra Mishra, Gulsum Gudukbay, Sethu Jose, J. Sampson, M. Kandemir, N. Vijaykrishnan","doi":"10.1109/HPCA47549.2020.00034","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00034","url":null,"abstract":"Many recent works have shown substantial efficiency boosts from performing inference tasks on Internet of Things (IoT) nodes rather than merely transmitting raw sensor data. However, such tasks, e.g., convolutional neural networks (CNNs), are very compute intensive. They are therefore challenging to complete at sensing-matched latencies in ultra-low-power and energy-harvesting IoT nodes. ReRAM crossbar-based accelerators (RCAs) are an ideal candidate to perform the dominant multiplication-and-accumulation (MAC) operations in CNNs efficiently, but conventional, performance-oriented RCAs, while energy-efficient, are power hungry and ill-optimized for the intermittent and unstable power supply of energy-harvesting IoT nodes. This paper presents the ResiRCA architecture that integrates a new, lightweight, and configurable RCA suitable for energy harvesting environments as an opportunistically executing augmentation to a baseline sense-and-transmit battery-powered IoT node. To maximize ResiRCA throughput under different power levels, we develop the ResiSchedule approach for dynamic RCA reconfiguration. The proposed approach uses loop tiling-based computation decomposition, model duplication within the RCA, and inter-layer pipelining to reduce RCA activation thresholds and more closely track execution costs with dynamic power income. Experimental results show that ResiRCA together with ResiSchedule achieve average speedups and energy efficiency improvements of 8x and 14x respectively compared to a baseline RCA with intermittency-unaware scheduling.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132687142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

A Deep Reinforcement Learning Framework for Architectural Exploration: A Routerless NoC Case Study 用于架构探索的深度强化学习框架:无路由器NoC案例研究

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00018

Ting-Ru Lin, Drew Penney, M. Pedram, Lizhong Chen

{"title":"A Deep Reinforcement Learning Framework for Architectural Exploration: A Routerless NoC Case Study","authors":"Ting-Ru Lin, Drew Penney, M. Pedram, Lizhong Chen","doi":"10.1109/HPCA47549.2020.00018","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00018","url":null,"abstract":"Machine learning applied to architecture design presents a promising opportunity with broad applications. Recent deep reinforcement learning (DRL) techniques, in particular, enable efficient exploration in vast design spaces where conventional design strategies may be inadequate. This paper proposes a novel deep reinforcement framework, taking routerless networks-on-chip (NoC) as an evaluation case study. The new framework successfully resolves problems with prior design approaches, which are either unreliable due to random searches or inflexible due to severe design space restrictions. The framework learns (near-)optimal loop placement for routerless NoCs with various design constraints. A deep neural network is developed using parallel threads that efficiently explore the immense routerless NoC design space with a Monte Carlo search tree. Experimental results show that, compared with conventional mesh, the proposed deep reinforcement learning (DRL) routerless design achieves a 3.25x increase in throughput, 1.6x reduction in packet latency, and 5x reduction in power. Compared with the state-of-the-art routerless NoC, DRL achieves a 1.47x increase in throughput, 1.18x reduction in packet latency, 1.14x reduction in average hop count, and 6.3% lower power consumption.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128123191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

DWT: Decoupled Workload Tracing for Data Centers DWT:数据中心的解耦工作负载跟踪

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00061

Jian Chen, Ying Zhang, Xiaowei Jiang, Li Zhao, Zheng Cao, Qiang Liu

{"title":"DWT: Decoupled Workload Tracing for Data Centers","authors":"Jian Chen, Ying Zhang, Xiaowei Jiang, Li Zhao, Zheng Cao, Qiang Liu","doi":"10.1109/HPCA47549.2020.00061","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00061","url":null,"abstract":"Workload tracing is the foundational technology that many applications hinge upon. However, recent paradigm shift to-ward cloud computing has caused tremendous challenges to traditional workload tracing. Existing solutions either require a dedicated offline cluster or fail to capture the full-spectrum workload characteristics. This paper proposes DWT, a novel framework that leverages fast online instruction tracing, and uses synthetic data offline for memory access pattern reconstruction, thereby capturing the full workload characteristics while obviating the need of dedicated clusters. Experiment results show that the stack distance profiles generated from synthetic address traces match well with the original ones across all SPEC CPU 2017 programs and representative cloud applications, with correlation coefficient R^2 no less than 0.9. The page-level access frequencies also match well with those of the original programs. This decoupled tracing approach not only removes the roadblocks on workload characterization for data centers, but also enables new applications such as efficient online resource management.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128320488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Multi-Range Supported Oblivious RAM for Efficient Block Data Retrieval 高效块数据检索的多范围支持遗忘RAM

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00038

Yuezhi Che, Rujia Wang

{"title":"Multi-Range Supported Oblivious RAM for Efficient Block Data Retrieval","authors":"Yuezhi Che, Rujia Wang","doi":"10.1109/HPCA47549.2020.00038","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00038","url":null,"abstract":"Data locality exists everywhere in the memory hierarchy. Most applications show temporal and spatial locality, and computer system and architecture designers utilize this property to improve the system performance with better data layout, prefetching, and scheduling. The locality property can be represented by memory access patterns, which records the time and frequency of accessed addresses. From the security perspective, if an attacker can trace the access pattern, sensitive information inside of the application could be observed and leaked. Oblivious RAM is one of the most effective solutions to mitigate the access pattern leakage on the system, which adds redundant data blocks in space and time. With ORAM protection, the intrinsic data locality is broken by the randomly stored data. Therefore, the application cannot gain any performance benefits from locality if the ORAM protocol is used. In this work, we would like to study the potential to support multi-range accesses with new storage and access efficient ORAM construction. Our proposed designs include two major schemes: Lite-rORAM, which minimize the storage overhead of existing rORAM; and Hybrid-rORAM, which support multiple ranges accesses with minimum storage overhead. We achieve the goal to preserve the locality for consecutive data blocks with different ranges in the application while obfuscates the access pattern as well. We tested our proposed schemes with different workloads on local and remote backends. The experimental results show that, in the best case, our proposed ORAM construction can reduce the data block retrieval time to 0.24x of the baseline Path ORAM, with 87.5% storage overhead reduction compared to rORAM.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124521076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

CASINO Core Microarchitecture: Generating Out-of-Order Schedules Using Cascaded In-Order Scheduling Windows CASINO核心微架构:使用级联的顺序调度窗口生成乱序调度

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00039

Ipoom Jeong, Seihoon Park, Changmin Lee, W. Ro

{"title":"CASINO Core Microarchitecture: Generating Out-of-Order Schedules Using Cascaded In-Order Scheduling Windows","authors":"Ipoom Jeong, Seihoon Park, Changmin Lee, W. Ro","doi":"10.1109/HPCA47549.2020.00039","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00039","url":null,"abstract":"The performance gap between in-order (InO) and out-of-order (OoO) cores comes from the ability to dynamically create highly optimized instruction issue schedules. In this work, we observe that a significant amount of performance benefit of OoO scheduling can also be attained by supplementing a traditional InO core with a small and speculative instruction scheduling window, namely SpecInO. SpecInO monitors a small set of instructions ahead of a conventional InO scheduling window, aiming at issuing ready instructions behind long-latency stalls. Simulation results show that SpecInO captures and issues 62% of dynamic instructions out of program order. To this end, we propose a CASINO core microarchitecture that dynamically and speculatively generates OoO schedules with near-InO complexity, using CAScaded IN-Order scheduling windows. A Speculative IQ (S-IQ) issues an instruction if it is ready, or otherwise passes it to the next IQ. At the last IQ, instructions are scheduled in program order along serial dependence chains. The net effect is OoO scheduling via collaboration between cascaded InO IQs. To support speculative execution with minimal cost overhead, we propose a novel register renaming technique that allocates free physical registers only to instructions issued from the S-IQ. The proposed core performs dynamic memory disambiguation via an on-commit value check by extending the store buffer already existing in an InO core. We further optimize energy efficiency by filtering out redundant associative searches performed by speculated loads. In our analysis, CASINO core improves performance by 51% over an InO core (within 10 percentage points of an OoO core), which results in 25% and 42% improvements in energy efficiency over InO and OoO cores, respectively.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117277120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

QuickNN: Memory and Performance Optimization of k-d Tree Based Nearest Neighbor Search for 3D Point Clouds 基于k-d树的3D点云最近邻搜索的内存和性能优化

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00024

Reid Pinkham, Shuqing Zeng, Zhengya Zhang

{"title":"QuickNN: Memory and Performance Optimization of k-d Tree Based Nearest Neighbor Search for 3D Point Clouds","authors":"Reid Pinkham, Shuqing Zeng, Zhengya Zhang","doi":"10.1109/HPCA47549.2020.00024","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00024","url":null,"abstract":"The use of Light Detection And Ranging (LiDAR) has enabled the continued improvement in accuracy and performance of autonomous navigation. The latest applications require LiDAR's of the highest spatial resolution, which generate a massive amount of 3D point clouds that need to be processed in real time. In this work, we investigate the architecture design for k-Nearest Neighbor (kNN) search, an important processing kernel for 3D point clouds. An approximate kNN search based on a k-dimensional (k-d) tree is employed to improve performance. However, even for today's moderate-sized problems, this approximate kNN search is severely hindered by memory bandwidth due to numerous random accesses and minimal data reuse opportunities. We apply several memory optimization schemes to alleviate the bandwidth bottleneck: 1) the k-d tree data structure is partitioned to two sets: tree nodes and point buckets, based on their distinct characteristics – tree nodes that have high reuse are cached for their lifetime to facilitate search, while point buckets with low reuse are organized in regular contiguous segments in external memory to facilitate efficient burst access; 2) write and read caches are added to gather random accesses to transform them to sequential accesses; and 3) tree construction and tree search are interleaved to cut redundant access streams. With optimized memory bandwidth, the kNN search can be further accelerated by two new processing schemes: 1) parallel tree traversal that utilizes multiple workers with minimal tree duplication overhead, and 2) incremental tree building that minimizes the overhead of tree construction by dynamically updating the tree instead of building it from scratch every time. We demonstrate the performance and memory-optimized QuickNN architecture on FPGA and perform exhaustive benchmarking, showing that up to a 19× and 7.3× speedup over k-d tree searches performed on a modern CPU and GPU, respectively, and a 14.5× speedup over a comparable sized architecture performing an exact search. Finally, we show that QuickNN achieves two orders of magnitude performance per watt increase over CPU and GPU methods.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114232638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00036

Linghao Song, Fan Chen, Youwei Zhuo, Xuehai Qian, H. Li, Yiran Chen

{"title":"AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators","authors":"Linghao Song, Fan Chen, Youwei Zhuo, Xuehai Qian, H. Li, Yiran Chen","doi":"10.1109/HPCA47549.2020.00036","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00036","url":null,"abstract":"Deep neural network (DNN) accelerators as an example of domain-specific architecture have demonstrated great success in DNN inference. However, the architecture acceleration for equally important DNN training has not yet been fully studied. With data forward, error backward and gradient calculation, DNN training is a more complicated process with higher computation and communication intensity. Because the recent research demonstrates a diminishing specialization return, namely, \"accelerator wall\", we believe that a promising approach is to explore coarse-grained parallelism among multiple performance-bounded accelerators to support DNN training. Distributing computations on multiple heterogeneous accelerators to achieve high throughput and balanced execution, however, remaining challenging. We present AccPar, a principled and systematic method of determining the tensor partition among heterogeneous accelerator arrays. Compared to prior empirical or unsystematic methods, AccPar considers the complete tensor partition space and can reveal previously unknown new parallelism configurations. AccPar optimizes the performance based on a cost model that takes into account both computation and communication costs of a heterogeneous execution environment. Hence, our method can avoid the drawbacks of existing approaches that use communication as a proxy of the performance. The enhanced flexibility of tensor partitioning in AccPar allows the flexible ratio of computations to be distributed among accelerators with different performances. The proposed search algorithm is also applicable to the emerging multi-path patterns in modern DNNs such as ResNet. We simulate AccPar on a heterogeneous accelerator array composed of both TPU-v2 and TPU-v3 accelerators for the training of large-scale DNN models such as Alexnet, Vgg series, and Resnet series. The average performance improvements of the state-of-the-art \"one weird trick\" (OWT) and HYPAR, and AccPar, normalized to the baseline data parallelism scheme where each accelerator replicates the model and processes different input data in parallel, are 2.98×, 3.78×, and 6.30×, respectively.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114691897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

SpArch: Efficient Architecture for Sparse Matrix Multiplication SpArch:稀疏矩阵乘法的高效架构

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00030

Zhekai Zhang, Hanrui Wang, Song Han, W. Dally

{"title":"SpArch: Efficient Architecture for Sparse Matrix Multiplication","authors":"Zhekai Zhang, Hanrui Wang, Song Han, W. Dally","doi":"10.1109/HPCA47549.2020.00030","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00030","url":null,"abstract":"Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) is a ubiquitous task in various engineering and scientific applications. However, inner product based SpGEMM introduces redundant input fetches for mismatched nonzero operands, while outer product based approach suffers from poor output locality due to numerous partial product matrices. Inefficiency in the reuse of either inputs or outputs data leads to extensive and expensive DRAM access. To address this problem, this paper proposes an efficient sparse matrix multiplication accelerator architecture, SpArch, which jointly optimizes the data locality for both input and output matrices. We first design a highly parallelized streaming-based merger to pipeline the multiply and merge stage of partial matrices so that partial matrices are merged on chip immediately after produced. We then propose a condensed matrix representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5.4x. We further develop a Huffman tree scheduler to improve the scalability of the merger for larger sparse matrices, which reduces the DRAM access by another 1.8x. We also resolve the increased input matrix read induced by the new representation using a row prefetcher with near-optimal buffer replacement policy, further reducing the DRAM access by 1.5x. Evaluated on 20 benchmarks, SpArch reduces the total DRAM access by 2.8x over previous state-of-the-art. On average, SpArch achieves 4x, 19x, 18x, 17x, 1285x speedup and 6x, 164x, 435x, 307x, 62x energy savings over OuterSpace, MKL, cuSPARSE, CUSP, and ARM Armadillo, respectively.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133648679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 158

Precise Runahead Execution 精确的提前执行

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00040

Ajeya Naithani, Josué Feliu, Almutaz Adileh, L. Eeckhout

{"title":"Precise Runahead Execution","authors":"Ajeya Naithani, Josué Feliu, Almutaz Adileh, L. Eeckhout","doi":"10.1109/HPCA47549.2020.00040","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00040","url":null,"abstract":"Runahead execution improves processor performance by accurately prefetching long-latency memory accesses. When a long-latency load causes the instruction window to fill up and halt the pipeline, the processor enters runahead mode and keeps speculatively executing code to trigger accurate prefetches. A recent improvement tracks the chain of instructions that leads to the long-latency load, stores it in a runahead buffer, and executes only this chain during runahead execution, with the purpose of generating more prefetch requests. Unfortunately, all prior runahead proposals have shortcomings that limit performance and energy efficiency because they release processor state when entering runahead mode and then need to refill the pipeline to restart normal operation. Moreover, runahead buffer limits prefetch coverage by tracking only a single chain of instructions that leads to the same long-latency load. We propose precise runahead execution (PRE) which builds on the key observation that when entering runahead mode, the processor has enough issue queue and physical register file resources to speculatively execute instructions. This mitigates the need to release and re-fill processor state in the ROB, issue queue, and physical register file. In addition, PRE pre-executes only those instructions in runahead mode that lead to full-window stalls, using a novel register renaming mechanism to quickly free physical registers in runahead mode, further improving efficiency and effectiveness. Finally, PRE optionally buffers decoded runahead micro-ops in the frontend to save energy. Our experimental evaluation using a set of memory-intensive applications shows that PRE achieves an additional 18.2% performance improvement over the recent runahead proposals while at the same time reducing energy consumption by 6.8%.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128791989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13