2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献_第2页

Searching for Potential gRNA Off-Target Sites for CRISPR/Cas9 Using Automata Processing Across Different Platforms 利用不同平台上的自动机处理寻找CRISPR/Cas9潜在的gRNA脱靶位点

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00068

Chunkun Bo, V. Dang, Elaheh Sadredini, K. Skadron

{"title":"Searching for Potential gRNA Off-Target Sites for CRISPR/Cas9 Using Automata Processing Across Different Platforms","authors":"Chunkun Bo, V. Dang, Elaheh Sadredini, K. Skadron","doi":"10.1109/HPCA.2018.00068","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00068","url":null,"abstract":"The CRISPR/Cas system is a bacteria immune system protecting cells from foreign genetic elements. One version that attracted special interest is CRISPR/Cas9, because it can be modified to edit genomes at targeted locations. However, the risk of binding and damaging off-target locations limits its power. Identifying all these potential off-target sites is thus important for users to effectively use the system to edit genomes. This process is computationally expensive, especially when one allows more differences in gRNA targeting sequences. In this paper, we propose using automata to search for off-target sites while allowing differences between the reference genome and gRNA targeting sequences. We evaluate the automata-based approach on four different platforms, including conventional architectures such as the CPU and the GPU, and spatial architectures such as the FPGA and Micron's Automata Processor. We compare the proposed approach with two off-target search tools (CasOFFinder (GPU) and CasOT (CPU)), and achieve over 83x speedups on the FPGA compared with CasOFFinder and over 600x speedups compared with CasOT. More customized hardware such as the AP can provide additional speedups (1.5x for the kernel execution) compared with the FPGA. We also evaluate the automata-based solution using single-thread HyperScan (a high-performance automata processing library) on the CPU. HyperScan outperforms CasOT by over 29.7x. The automata-based approach on iNFAnt2 (a DFA/NFA engine on the GPU) does not consistently work better than CasOFFinder, and only show a slightly better speedup compared with single-thread HyperScan on the CPU (4.4x for the best case). These results show that the automata-based approach provides significant algorithmic benefits, and that accelerators such as the FPGA and the AP can provide substantial additional speedups. However, iNFAnt2 does not confer a clear advantage because the proposed method does not map well to the GPU architecture. Furthermore, we propose several methods to further improve the performance on spatial architectures, and some potential architectural modifications for future automata processing hardware.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"15 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124468455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Reducing Data Transfer Energy by Exploiting Similarity within a Data Transaction 利用数据事务中的相似性减少数据传输能量

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00014

Donghyuk Lee, Mike O'Connor, Niladrish Chatterjee

{"title":"Reducing Data Transfer Energy by Exploiting Similarity within a Data Transaction","authors":"Donghyuk Lee, Mike O'Connor, Niladrish Chatterjee","doi":"10.1109/HPCA.2018.00014","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00014","url":null,"abstract":"Modern highly parallel GPU systems require highbandwidth DRAM I/O interfaces that can consume a significant amount of energy. This energy increases in proportion to the number of 1 values in the data transactions due to the asymmetric energy consumption of Pseudo Open Drain (POD) I/O interface in contemporary Graphics DDR SDRAMs. In this work, we describe a technique to save energy by reducing the energy-expensive 1 values in the DRAM interface. We observe that multiple data elements within a single cache line/sector are often similar to one another. We exploit this characteristic to encode each transfer to the DRAM such that there is one reference copy of the data, with remaining similar data items being encoded predominantly as 0 values. Our proposed low energy data transfer mechanism, Base+XOR Transfer, encodes the data-similar portion by performing XOR operations between data elements within a single DRAM transaction. We address two challenges that influence the efficiency of our mechanism, i) the frequent appearance of zero data elements in transactions, and ii) the diversity in the underlying size of data types within a transaction. We describe two techniques, Zero Data Remapping and Universal Base+XOR Transfer, to efficiently address these issues. Our proposed encoding scheme requires no additional metadata or changes to existing DRAM devices. We evaluate our mechanism on a modern high performance GPU system with a variety of graphics and compute workloads. We show that our mechanism reduces energy-expensive 1 values by 35.3% with minimal overheads, and combining our mechanism with Dynamic Bus Inversion (DBI) reduces 1 values by 48.2% on average. These 1 value reductions lead to 5.8% and 7.1% DRAM energy savings, respectively.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116694242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Don’t Correct the Tags in a Cache, Just Check Their Hamming Distance from the Lookup Tag 不要纠正缓存中的标签，只是检查它们与查找标签的汉明距离

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00055

Alex Gendler, A. Bramnik, Ariel Szapiro, Yiannakis Sazeides

引用次数: 5

The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices DRAM延迟PUF:利用现代商品DRAM设备的延迟-可靠性权衡快速评估物理不可克隆功能

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00026

Jeremie S. Kim, Minesh Patel, Hasan Hassan, O. Mutlu

{"title":"The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices","authors":"Jeremie S. Kim, Minesh Patel, Hasan Hassan, O. Mutlu","doi":"10.1109/HPCA.2018.00026","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00026","url":null,"abstract":"Physically Unclonable Functions (PUFs) are commonly used in cryptography to identify devices based on the uniqueness of their physical microstructures. DRAM-based PUFs have numerous advantages over PUF designs that exploit alternative substrates: DRAM is a major component of many modern systems, and a DRAM-based PUF can generate many unique identiers. However, none of the prior DRAM PUF proposals provide implementations suitable for runtime-accessible PUF evaluation on commodity DRAM devices. Prior DRAM PUFs exhibit unacceptably high latencies, especially at low temperatures (e.g., >125.8s on average for a 64KiB memory segment below 55C), and they cause high system interference by keeping part of DRAM unavailable during PUF evaluation. In this paper, we introduce the DRAM latency PUF, a new class of fast, reliable DRAM PUFs. The key idea is to reduce DRAM read access latency below the reliable datasheet specications using software-only system calls. Doing so results in error patterns that reect the compound eects of manufacturing variations in various DRAM structures (e.g., capacitors, wires, sense ampli- ers). Based on a rigorous experimental characterization of 223 modern LPDDR4 DRAM chips, we demonstrate that these error patterns 1) satisfy runtime-accessible PUF requirements, and 2) are quickly generated (i.e., at 88.2ms) irrespective of operating temperature using a real system with no additional hardware modications. We show that, for a constant DRAM capacity overhead of 64KiB, our implementation of the DRAM latency PUF enables an average (minimum, maximum) PUF evaluation time speedup of 152x (109x, 181x) at 70C and 1426x (868x, 1783x) at 55C when compared to a DRAM retention PUF and achieves greater speedups at even lower temperatures.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123714010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 115

Comprehensive VM Protection Against Untrusted Hypervisor Through Retrofitted AMD Memory Encryption 通过改进的AMD内存加密，对不受信任的Hypervisor进行全面的虚拟机保护

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00045

Yuming Wu, Yutao Liu, Ruifeng Liu, Haibo Chen, B. Zang, Haibing Guan

{"title":"Comprehensive VM Protection Against Untrusted Hypervisor Through Retrofitted AMD Memory Encryption","authors":"Yuming Wu, Yutao Liu, Ruifeng Liu, Haibo Chen, B. Zang, Haibing Guan","doi":"10.1109/HPCA.2018.00045","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00045","url":null,"abstract":"The confidentiality of tenant’s data is confronted with high risk when facing hardware attacks and privileged malicious software. Hardware-based memory encryption is one of the promising means to provide strong guarantees of data security. Recently AMD has proposed its new memory encryption hardware called SME and SEV, which can selectively encrypt memory regions in a fine-grained manner, e.g., by setting the C-bits in the page table entries. More importantly, SEV further supports encrypted virtual machines. This, intuitively, has provided a new opportunity to protect data confidentiality in guest VMs against an untrusted hypervisor in the cloud environment. In this paper, we first provide a security analysis on the (in)security of SEV and uncover a set of security issues of using SEV as a means to defend against an untrusted hypervisor. Based on the study, we then propose a software-based extension to the SEV feature, namely Fidelius, to address those issues while retaining performance efficiency. Fidelius separates the management of critical resources from service provisioning and revokes the permissions of accessing specific resources from the un-trusted hypervisor. By adopting a sibling-based protection mechanism with non-bypassable memory isolation, Fidelius embraces both security and efficiency, as it introduces no new layer of abstraction. Meanwhile, Fidelius reuses the SEV API to provide a full VM life-cycle protection, including two sets of para-virtualized I/O interfaces to encode the I/O data, which is not considered in the SEV hardware design. A detailed and quantitative security analysis shows its effectiveness in protecting tenant’s data from a variety of attack surfaces, and the performance evaluation confirms the performance efficiency of Fidelius.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123784759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Are Coherence Protocol States Vulnerable to Information Leakage? 相干协议状态易受信息泄露的影响吗?

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00024

Fan Yao, M. Doroslovački, Guru Venkataramani

{"title":"Are Coherence Protocol States Vulnerable to Information Leakage?","authors":"Fan Yao, M. Doroslovački, Guru Venkataramani","doi":"10.1109/HPCA.2018.00024","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00024","url":null,"abstract":"Most commercial multi-core processors incorporate hardware coherence protocols to support efficient data transfers and updates between their constituent cores. While hardware coherence protocols provide immense benefits for application performance by removing the burden of software-based coherence, we note that understanding the security vulnerabilities posed by such oft-used, widely-adopted processor features is critical for secure processor designs in the future. In this paper, we demonstrate a new vulnerability exposed by cache coherence protocol states. We present novel insights into how adversaries could cleverly manipulate the coherence states on shared cache blocks, and construct covert timing channels to illegitimately communicate secrets to the spy. We demonstrate 6 different practical scenarios for covert timing channel construction. In contrast to prior works, we assume a broader adversary model where the trojan and spy can either exploit explicitly shared read-only physical pages (e.g., shared library code), or use memory deduplication feature to implicitly force create shared physical pages. We demonstrate how adversaries can manipulate combinations of coherence states and data placement in different caches to construct timing channels. We also explore how adversaries could exploit multiple caches and their associated coherence states to improve transmission bandwidth with symbols encoding multiple bits. Our experimental results on commercial systems show that the peak transmission bandwidths of these covert timing channels can vary between 700 to 1100 Kbits/sec. To the best of our knowledge, our study is the first to highlight the vulnerability of hardware cache coherence protocols to timing channels that can help computer architects to craft effective defenses against exploits on such critical processor features.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122508936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 92

Memory Hierarchy for Web Search 网络搜索的内存层次结构

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00061

Grant Ayers, Jung Ho Ahn, C. Kozyrakis, Parthasarathy Ranganathan

{"title":"Memory Hierarchy for Web Search","authors":"Grant Ayers, Jung Ho Ahn, C. Kozyrakis, Parthasarathy Ranganathan","doi":"10.1109/HPCA.2018.00061","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00061","url":null,"abstract":"Online data-intensive services, such as search, serve billions of users, utilize millions of cores, and comprise a significant and growing portion of datacenter-scale workloads. However, the complexity of these workloads and their proprietary nature has precluded detailed architectural evaluations and optimizations of processor design trade-offs. We present the first detailed study of the memory hierarchy for the largest commercial search engine today. We use a combination of measurements from longitudinal studies across tens of thousands of deployed servers, systematic microarchitectural evaluation on individual platforms, validated trace-driven simulation, and performance modeling – all driven by production workloads servicing real-world user requests. Our data quantifies significant differences between production search and benchmarks commonly used in the architecture community. We identify the memory hierarchy as an important opportunity for performance optimization, and present new insights pertaining to how search stresses the cache hierarchy, both for instructions and data. We show that, contrary to conventional wisdom, there is significant reuse of data that is not captured by current cache hierarchies, and discuss why this precludes state-of-the-art tiled and scale-out architectures. Based on these insights, we rethink a new cache hierarchy optimized for search that trades off the inefficient use of L3 cache transistors for higher-performance cores, and adds a latency-optimized on-package eDRAM L4 cache. Compared to state-of-the-art processors, our proposed design performs 27% to 38% better.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122155462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 69

High-Performance GPU Transactional Memory via Eager Conflict Detection 高性能GPU事务性内存通过急切冲突检测

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00029

X. Ren, Mieszko Lis

{"title":"High-Performance GPU Transactional Memory via Eager Conflict Detection","authors":"X. Ren, Mieszko Lis","doi":"10.1109/HPCA.2018.00029","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00029","url":null,"abstract":"GPUs transactional memory (TM) proposals to date have relied on lazy, value-based conflict detection, assuming that GPUs can amortize the latency by executing other warps. In practice, however, concurrency must be throttled to a few warps per core to avoid high abort rates, and TM performance has remained far below that of fine-grained locks. We trace this to the latency cost of validating transactions: two round trips across the crossbar required for most commits and aborts. With limited concurrency, the warp scheduler cannot amortize this, and leaves the core idle most of the time. In this paper, we show that value-based validation does not scale to high thread counts, and eager conflict detection becomes more efficient as the number of threads grows. We leverage this insight to propose GETM, a GPU TM with eager conflict detection. GETM relies on a novel distributed logical clock scheme to implement eager conflict detection without the need for cache coherence or signature broadcasts. GETM is up to 2.1 times faster than the state-of-the art prior work WarpTM (gmean 1.2 times), with 3.6 times lower silicon area overheads and 2.2 times lower power overheads.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133879537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Characterizing and Mitigating Output Reporting Bottlenecks in Spatial Automata Processing Architectures 空间自动机处理体系结构中输出报告瓶颈的表征与缓解

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00069

J. Wadden, K. Angstadt, K. Skadron

{"title":"Characterizing and Mitigating Output Reporting Bottlenecks in Spatial Automata Processing Architectures","authors":"J. Wadden, K. Angstadt, K. Skadron","doi":"10.1109/HPCA.2018.00069","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00069","url":null,"abstract":"Automata processing has seen a resurgence in importance due to its usefulness for pattern matching and pattern mining of \"big data.\" While large-scale automata processing is known to bottleneck von Neumann processors due to unpredictable memory accesses, spatial architectures excel at automata processing. Spatial architectures can implement automata graphs by wiring together automata states in reconfigurable arrays, allowing parallel automata state computation, and point-to-point state transitions on-chip. However, spatial automata processing architectures can suffer from output constraints (up to 255x in commercial systems!) due to the physical placement of states, output processing architecture design, I/O resources, and the massively parallel nature of the architecture. To understand this bottleneck, we conduct the first known characterization of output requirements of a realistic set of automata processing benchmarks. We find that most benchmarks report fairly frequently, but that few states report at any one time. This observation motivates new output compression schemes and reporting architectures. We evaluate the benefit of one purely software automata transformation and show that output reporting costs can be greatly reduced (improving performance by up to 40% without hardware modification. We then explore bottlenecks in the reporting architecture of a commercial spatial automata processor and propose a new architecture that improves performance by up to 5.1x.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131009351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

GPGPU Power Modeling for Multi-domain Voltage-Frequency Scaling 基于多域电压频率标度的GPGPU功率建模

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-02-01 DOI: 10.1109/HPCA.2018.00072

J. Guerreiro, A. Ilic, N. Roma, P. Tomás

{"title":"GPGPU Power Modeling for Multi-domain Voltage-Frequency Scaling","authors":"J. Guerreiro, A. Ilic, N. Roma, P. Tomás","doi":"10.1109/HPCA.2018.00072","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00072","url":null,"abstract":"Dynamic Voltage and Frequency Scaling (DVFS) on Graphics Processing Units (GPUs) components is one of the most promising power management strategies, due to its potential for significant power and energy savings. However, there is still a lack of simple and reliable models for the estimation of the GPU power consumption under a set of different voltage and frequency levels. Accordingly, a novel GPU power estimation model with both core and memory frequency scaling is herein proposed. This model combines information from both the GPU architecture and the executing GPU application and also takes into account the non-linear changes in the GPU voltage when the core and memory frequencies are scaled. The model parameters are estimated using a collection of 83 microbenchmarks carefully crafted to stress the main GPU components. Based on the hardware performance events gathered during the execution of GPU applications on a single frequency configuration, the proposed model allows to predict the power consumption of the application over a wide range of frequency configurations, as well as to decompose the contribution of different parts of the GPU pipeline to the overall power consumption. Validated on 3 GPU devices from the most recent NVIDIA microarchitectures (Pascal, Maxwell and Kepler), by using a collection of 26 standard benchmarks, the proposed model is able to achieve accurate results (7%, 6% and 12% mean absolute error) for the target GPUs (Titan Xp, GTX Titan X and Tesla K40c).","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117062012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37