2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献_第4页

Secure Dynamic Memory Scheduling Against Timing Channel Attacks 针对定时通道攻击的安全动态内存调度

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.27

Yao Wang, Benjamin Wu, G. Suh

引用次数: 11

Fast Decentralized Power Capping for Server Clusters 服务器集群的快速分散功率封顶

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.49

R. Azimi, Masoud Badiei, Xin Zhan, Na Li, S. Reda

{"title":"Fast Decentralized Power Capping for Server Clusters","authors":"R. Azimi, Masoud Badiei, Xin Zhan, Na Li, S. Reda","doi":"10.1109/HPCA.2017.49","DOIUrl":"https://doi.org/10.1109/HPCA.2017.49","url":null,"abstract":"Power capping is a mechanism to ensure that the power consumption of clusters does not exceed the provisioned resources. A fast power capping method allows for a safe over-subscription of the rated power distribution devices, provides equipment protection, and enables large clusters to participate in demand-response programs. However, current methods have a slow response time with a large actuation latency when applied across a large number of servers as they rely on hierarchical management systems. We propose a fast decentralized power capping (DPC) technique that reduces the actuation latency by localizing power management at each server. The DPC method is based on a maximum throughput optimization formulation that takes into account the workloads priorities as well as the capacity of circuit breakers. Therefore, DPC significantly improves the cluster performance compared to alternative heuristics. We implement the proposed decentralized power management scheme on a real computing cluster. Compared to state-of-the-art hierarchical methods, DPC reduces the actuation latency by 72% up to 86% depending on the cluster size. In addition, DPC improves the system throughput performance by 16%, while using only 0.02% of the available network bandwidth. We describe how to minimize the overhead of each local DPC agent to a negligible amount. We also quantify the traffic and fault resilience of our decentralized power capping approach.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123424145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

BRAVO: Balanced Reliability-Aware Voltage Optimization 平衡可靠性感知电压优化

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.56

Karthik Swaminathan, Nandhini Chandramoorthy, Chen-Yong Cher, Ramon Bertran Monfort, A. Buyuktosunoglu, P. Bose

{"title":"BRAVO: Balanced Reliability-Aware Voltage Optimization","authors":"Karthik Swaminathan, Nandhini Chandramoorthy, Chen-Yong Cher, Ramon Bertran Monfort, A. Buyuktosunoglu, P. Bose","doi":"10.1109/HPCA.2017.56","DOIUrl":"https://doi.org/10.1109/HPCA.2017.56","url":null,"abstract":"Defining a processor micro-architecture for a targeted productspace involves multi-dimensional optimization across performance, power and reliability axes. A key decision in sucha definition process is the circuit-and technology-driven parameterof the nominal (voltage, frequency) operating point. This is a challenging task, since optimizing individually orpair-wise amongst these metrics usually results in a designthat falls short of the specification in at least one of the threedimensions. Aided by academic research, industry has nowadopted early-stage definition methodologies that considerboth energy-and performance-related metrics. Reliabilityrelatedenhancements, on the other hand, tend to get factoredin via a separate thread of activity. This task is typically pursuedwithout thorough pre-silicon quantifications of the energyor even the performance cost. In the late-CMOS designera, reliability needs to move from a post-silicon afterthoughtor validation-only effort to a pre-silicon definitionprocess. In this paper, we present BRAVO, a methodologyfor such reliability-aware design space exploration. BRAVOis supported by a multi-core simulation framework that integratesperformance, power and reliability modeling capability. Errors induced by both soft and hard fault incidence arecaptured within the reliability models. We introduce the notionof the Balanced Reliability Metric (BRM), that we useto evaluate overall reliability of the processor across soft andhard error incidences. We demonstrate up to 79% improvementin reliability in terms of this metric, for only a 6% dropin overall energy efficiency over design points that maximizeenergy efficiency. We also demonstrate several real-life usecaseapplications of BRAVO in an industrial setting.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125185550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks GraphPIM:在图计算框架中启用指令级PIM卸载

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.54

Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, Hyesoon Kim

{"title":"GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks","authors":"Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, Hyesoon Kim","doi":"10.1109/HPCA.2017.54","DOIUrl":"https://doi.org/10.1109/HPCA.2017.54","url":null,"abstract":"With the emergence of data science, graph computing has become increasingly important these days. Unfortunately, graph computing typically suffers from poor performance when mapped to modern computing systems because of the overhead of executing atomic operations and inefficient utilization of the memory subsystem. Meanwhile, emerging technologies, such as Hybrid Memory Cube (HMC), enable the processing-in-memory (PIM) functionality with offloading operations at an instruction level. Instruction offloading to the PIM side has considerable potentials to overcome the performance bottleneck of graph computing. Nevertheless, this functionality for graph workloads has not been fully explored, and its applications and shortcomings have not been well identified thus far. In this paper, we present GraphPIM, a full-stack solution for graph computing that achieves higher performance using PIM functionality. We perform an analysis on modern graph workloads to assess the applicability of PIM offloading and present hardware and software mechanisms to efficiently make use of the PIM functionality. Following the real-world HMC 2.0 specification, GraphPIM provides performance benefits for graph applications without any user code modification or ISA changes. In addition, we propose an extension to PIM operations that can further bring performance benefits for more graph applications. The evaluation results show that GraphPIM achieves up to a 2.4X speedup with a 37% reduction in energy consumption.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126143470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 230

Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques MLC NAND闪存编程中的漏洞:实验分析、漏洞利用和缓解技术

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.61

Yu Cai, Saugata Ghose, Yixin Luo, K. Mai, O. Mutlu, E. Haratsch

{"title":"Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques","authors":"Yu Cai, Saugata Ghose, Yixin Luo, K. Mai, O. Mutlu, E. Haratsch","doi":"10.1109/HPCA.2017.61","DOIUrl":"https://doi.org/10.1109/HPCA.2017.61","url":null,"abstract":"Modern NAND flash memory chips provide high density by storing two bits of data in each flash cell, called a multi-level cell (MLC). An MLC partitions the threshold voltage range of a flash cell into four voltage states. When a flash cell is programmed, a high voltage is applied to the cell. Due to parasitic capacitance coupling between flash cells that are physically close to each other, flash cell programming can lead to cell-to-cell program interference, which introduces errors into neighboring flash cells. In order to reduce the impact of cell-to-cell interference on the reliability of MLC NAND flash memory, flash manufacturers adopt a two-step programming method, which programs the MLC in two separate steps. First, the flash memory partially programs the least significant bit of the MLC to some intermediate threshold voltage. Second, it programs the most significant bit to bring the MLC up to its full voltage state. In this paper, we demonstrate that two-step programming exposes new reliability and security vulnerabilities. We experimentally characterize the effects of two-step programming using contemporary 1X-nm (i.e., 15–19nm) flash memory chips. We find that a partially-programmed flash cell (i.e., a cell where the second programming step has not yet been performed) is much more vulnerable to cell-to-cell interference and read disturb than a fully-programmed cell. We show that it is possible to exploit these vulnerabilities on solid-state drives (SSDs) to alter the partially-programmed data, causing (potentially malicious) data corruption. Building on our experimental observations, we propose several new mechanisms for MLC NAND flash memory that eliminate or mitigate data corruption in partially-programmed cells, thereby removing or reducing the extent of the vulnerabilities, and at the same time increasing flash memory lifetime by 16%.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124286594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 117

Supporting Address Translation for Accelerator-Centric Architectures 支持以加速器为中心的体系结构的地址转换

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.19

Y. Hao, Zhenman Fang, Glenn D. Reinman, J. Cong

{"title":"Supporting Address Translation for Accelerator-Centric Architectures","authors":"Y. Hao, Zhenman Fang, Glenn D. Reinman, J. Cong","doi":"10.1109/HPCA.2017.19","DOIUrl":"https://doi.org/10.1109/HPCA.2017.19","url":null,"abstract":"While emerging accelerator-centric architectures offer orders-of-magnitude performance and energy improvements, use cases and adoption can be limited by their rigid programming model. A unified virtual address space between the host CPU cores and customized accelerators can largely improve the programmability, which necessitates hardware support for address translation. However, supporting address translation for customized accelerators with low overhead is nontrivial. Prior studies either assume an infinite-sized TLB and zero page walk latency, or rely on a slow IOMMU for correctness and safety—which penalizes the overall system performance. To provide efficient address translation support for accelerator-centric architectures, we examine the memory access behavior of customized accelerators to drive the TLB augmentation and MMU designs. First, to support bulk transfers of consecutive data between the scratchpad memory of customized accelerators and the memory system, we present a relatively small private TLB design to provide low-latency caching of translations to each accelerator. Second, to compensate for the effects of the widely used data tiling techniques, we design a shared level-two TLB to serve private TLB misses on common virtual pages, eliminating duplicate page walks from accelerators working on neighboring data tiles that are mapped to the same physical page. This two-level TLB design effectively reduces page walks by 75.8% on average. Finally, instead of implementing a dedicated MMU which introduces additional hardware complexity, we propose simply leveraging the host per-core MMU for efficient page walk handling. This mechanism is based on our insight that the existing MMU cache in the CPU MMU satisfies the demand of customized accelerators with minimal overhead. Our evaluation demonstrates that the combined approach incurs only a 6.4% performance overhead compared to the ideal address translation.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124746152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65

Cold Boot Attacks are Still Hot: Security Analysis of Memory Scramblers in Modern Processors 冷启动攻击仍然很热:现代处理器中内存扰频器的安全性分析

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.10

Salessawi Ferede Yitbarek, Misiker Tadesse Aga, R. Das, T. Austin

{"title":"Cold Boot Attacks are Still Hot: Security Analysis of Memory Scramblers in Modern Processors","authors":"Salessawi Ferede Yitbarek, Misiker Tadesse Aga, R. Das, T. Austin","doi":"10.1109/HPCA.2017.10","DOIUrl":"https://doi.org/10.1109/HPCA.2017.10","url":null,"abstract":"Previous work has demonstrated that systems with unencrypted DRAM interfaces are susceptible to cold boot attacks – where the DRAM in a system is frozen to give it sufficient retention time and is then re-read after reboot, or is transferred to an attacker's machine for extracting sensitive data. This method has been shown to be an effective attack vector for extracting disk encryption keys out of locked devices. However, most modern systems incorporate some form of data scrambling into their DRAM interfaces making cold boot attacks challenging. While first added as a measure to improve signal integrity and reduce power supply noise, these scram-blers today serve the added purpose of obscuring the DRAM contents. It has previously been shown that scrambled DDR3 systems do not provide meaningful protection against cold boot attacks. In this paper, we investigate the enhancements that have been introduced in DDR4 memory scramblers in the 6th generation Intel Core (Skylake) processors. We then present an attack that demonstrates these enhanced DDR4 scramblers still do not provide sufficient protection against cold boot attacks. We detail a proof-of-concept attack that extracts memory resident AES keys, including disk encryption keys. The limitations of memory scramblers we point out in this paper motivate the need for strong yet low-overhead full-memory encryption schemes. Existing schemes such as Intel's SGX can effectively prevent such attacks, but have overheads that may not be acceptable for performance-sensitive applications. However, it is possible to deploy a memory encryption scheme that has zero performance overhead by forgoing integrity checking and replay attack protections afforded by Intel SGX. To that end, we present analyses that confirm modern stream ciphers such as ChaCha8 are sufficiently fast that it is now possible to completely overlap keystream generation with DRAM row buffer access latency, thereby enabling the creation of strongly encrypted DRAMs with zero exposed latency. Adopting such low-overhead measures in future generation of products can effectively shut down cold boot attacks in systems where the overhead of existing memory encryption schemes is unacceptable. Furthermore, the emergence of non-volatile DIMMs that fit into DDR4 buses is going to exacerbate the risk of cold boot attacks. Hence, strong full memory encryption is going to be even more crucial on such systems.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123016685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

MemPod: A Clustered Architecture for Efficient and Scalable Migration in Flat Address Space Multi-level Memories MemPod:在平面地址空间多级存储器中实现高效和可扩展迁移的集群架构

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.39

A. Prodromou, Mitesh R. Meswani, N. Jayasena, G. Loh, D. Tullsen

{"title":"MemPod: A Clustered Architecture for Efficient and Scalable Migration in Flat Address Space Multi-level Memories","authors":"A. Prodromou, Mitesh R. Meswani, N. Jayasena, G. Loh, D. Tullsen","doi":"10.1109/HPCA.2017.39","DOIUrl":"https://doi.org/10.1109/HPCA.2017.39","url":null,"abstract":"In the near future, die-stacked DRAM will be increasingly present in conjunction with off-chip memories in hybrid memory systems. Research on this subject revolves around using the stacked memory as a cache or as part of a flat address space. This paper proposes MemPod, a scalable and efficient memory management mechanism for flat address space hybrid memories. MemPod monitors memory activity and periodically migrates the most frequently accessed memory pages to the faster on-chip memory. MemPod's partitioned architectural organization allows for efficientscaling with memory system capabilities. Further, a big data analytics algorithm is adapted to develop an efficient, low-cost activity tracking technique. MemPod improves the average main memory access time of multi-programmed workloads, by up to 29% (9% on average) compared to the state of the art, and that will increase as the differential between memory speeds widens. MemPod's novel activity tracking approach leads to significant cost reduction (12800x lower storage space requirements) and improved future prediction accuracy over prior work which maintains a separatecounter per page.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134132846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators 现代高性能计算并行加速器辐射诱导误差临界

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.41

Daniel Oliveira, L. Pilla, Mauricio Hanzich, Vinicius Fratin, Fernando Fernandes, Caio B. Lunardi, J. Cela, P. Navaux, L. Carro, P. Rech

{"title":"Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators","authors":"Daniel Oliveira, L. Pilla, Mauricio Hanzich, Vinicius Fratin, Fernando Fernandes, Caio B. Lunardi, J. Cela, P. Navaux, L. Carro, P. Rech","doi":"10.1109/HPCA.2017.41","DOIUrl":"https://doi.org/10.1109/HPCA.2017.41","url":null,"abstract":"In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing~(HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications' output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"152 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129213156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Needle: Leveraging Program Analysis to Analyze and Extract Accelerators from Whole Programs 针:利用程序分析分析和提取加速器从整个程序

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2017-02-01 DOI: 10.1109/HPCA.2017.59

Snehasish Kumar, Nick Sumner, V. Srinivasan, Steve Margerm, Arrvindh Shriraman

{"title":"Needle: Leveraging Program Analysis to Analyze and Extract Accelerators from Whole Programs","authors":"Snehasish Kumar, Nick Sumner, V. Srinivasan, Steve Margerm, Arrvindh Shriraman","doi":"10.1109/HPCA.2017.59","DOIUrl":"https://doi.org/10.1109/HPCA.2017.59","url":null,"abstract":"Technology constraints have increasingly led to the adoption of specialized coprocessors, i.e. hardware accelerators. The first challenge that computer architects encounter is identifying \"what to specialize in the program\". We demonstrate that this requires precise enumeration of program paths based on dynamic program behavior. We hypothesize that path-based [4] accelerator offloading leads to good coverage of dynamic instructions and improve energy efficiency. Unfortunately, hot paths across programs demonstrate diverse control flow behavior. Accelerators (typically based on dataflow execution), often lack an energy-efficient, complexity effective, and high performance (eg. branch prediction) support for control flow. We have developed NEEDLE, an LLVM based compiler framework that leverages dynamic profile information to identify, merge, and offload acceleratable paths from whole applications. NEEDLE derives insight into what code coverage (and consequently energy reduction) an accelerator can achieve. We also develop a novel program abstraction for offload calledBraid, that merges common code regions across different paths to improve coverage of the accelerator while trading off the increase in dataflow size. This enables coarse grained offloading, reducing interaction with the host CPU core. To prepare the Braids and paths for acceleration, NEEDLE generates software frames. Software frames enable energy efficient speculative execution on accelerators. They are accelerator microarchitecture independent support speculative execution including memory operations. NEEDLE is automated and has been used to analyze 225K paths across 29 workloads. It filtered and ranked 154K paths for acceleration across unmodified SPEC, PARSEC and PERFECT workload suites. We target NEEDLE's offload regions toward a CGRA and demonstrate 34% performance and 20% energy improvement.","PeriodicalId":118950,"journal":{"name":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131278611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16