2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献

筛选
英文 中文
A^3: Accelerating Attention Mechanisms in Neural Networks with Approximation A^3:近似神经网络的加速注意机制
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00035
Tae Jun Ham, Sungjun Jung, Seonghak Kim, Young H. Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W. Lee, D. Jeong
{"title":"A^3: Accelerating Attention Mechanisms in Neural Networks with Approximation","authors":"Tae Jun Ham, Sungjun Jung, Seonghak Kim, Young H. Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W. Lee, D. Jeong","doi":"10.1109/HPCA47549.2020.00035","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00035","url":null,"abstract":"With the increasing computational demands of the neural networks, many hardware accelerators for the neural networks have been proposed. Such existing neural network accelerators often focus on popular neural network types such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs); however, not much attention has been paid to attention mechanisms, an emerging neural network primitive that enables neural networks to retrieve most relevant information from a knowledge-base, external memory, or past states. The attention mechanism is widely adopted by many state-of-the-art neural networks for computer vision, natural language processing, and machine translation, and accounts for a large portion of total execution time. We observe today's practice of implementing this mechanism using matrix-vector multiplication is suboptimal as the attention mechanism is semantically a content-based search where a large portion of computations ends up not being used. Based on this observation, we design and architect A3, which accelerates attention mechanisms in neural networks with algorithmic approximation and hardware specialization. Our proposed accelerator achieves multiple orders of magnitude improvement in energy efficiency (performance/watt) as well as substantial speedup over the state-of-the-art conventional hardware.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"62 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129167394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 100
Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems 非对称弹性:利用基于加速器的系统中瞬态错误恢复的任务级等幂
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00014
Jingwen Leng, A. Buyuktosunoglu, Ramon Bertran Monfort, P. Bose, Quan Chen, M. Guo, V. Reddi
{"title":"Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems","authors":"Jingwen Leng, A. Buyuktosunoglu, Ramon Bertran Monfort, P. Bose, Quan Chen, M. Guo, V. Reddi","doi":"10.1109/HPCA47549.2020.00014","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00014","url":null,"abstract":"Accelerators make the task of building systems that are re-silient against transient errors like voltage noise and soft errors hard. Architects integrate accelerators into the system as black box third-party IP components. So a fault in one or more accelerators may threaten the system's reliability if there are no established failure semantics for how an error propagates from the accelerator to the main CPU. Existing solutions that assure system reliability come at the cost of sacrificing accelerator generality, efficiency, and incur significant overhead, even in the absence of errors. To over-come these drawbacks, we examine reliability management of accelerator systems via hardware-software co-design, coupling an efficient architecture design with compiler and run-time support, to cope with transient errors. We introduce asymmetric resilience that architects reliability at the system level, centered around a hardened CPU, rather than at the accelerator level. At runtime, the system exploits task-level idempotency to contain accelerator errors and use memory protection instead of taking checkpoints to mitigate over-heads. We also leverage the fact that errors rarely occur in systems, and exploit the trade-off between error recovery performance and improved error-free performance to enhance system efficiency. Using GPUs, which are at the fore-front of accelerator systems, we demonstrate how our system architecture manages reliability in both integrated and discrete systems, under voltage-noise and soft-error related faults, leading to extremely low overhead (less than 1%) and substantial gains (20% energy savings on average).","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126016390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
ELP2IM: Efficient and Low Power Bitwise Operation Processing in DRAM ELP2IM:高效和低功耗的DRAM位操作处理
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00033
Xin Xin, Youtao Zhang, Jun Yang
{"title":"ELP2IM: Efficient and Low Power Bitwise Operation Processing in DRAM","authors":"Xin Xin, Youtao Zhang, Jun Yang","doi":"10.1109/HPCA47549.2020.00033","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00033","url":null,"abstract":"Recently proposed DRAM based memory-centric architectures have demonstrated their great potentials in addressing the memory wall challenge of modern computing systems. Such architectures exploit charge sharing of multiple rows to enable in-memory bitwise operations. However, existing designs rely heavily on reserved rows to implement computation, which introduces high data movement overhead, large operation latency, large energy consumption, and low operation reliability. In this paper, we propose ELP2IM, an efficient and low power processing in-memory architecture, to address the above issues. ELP2IM utilizes two stable states of sense amplifiers in DRAM subarrays so that it can effectively reduce the number of intra-subarray data movements as well as the number of concurrently opened DRAM rows, which exhibits great performance and energy consumption advantages over existing designs. Our experimental results show that the power efficiency of ELP2IM is more than 2X improvement over the state-of-the-art DRAM based memory-centric designs in real application.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"486 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129511745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
PIXEL: Photonic Neural Network Accelerator PIXEL:光子神经网络加速器
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00046
Kyle Shiflett, Dylan Wright, Avinash Karanth, A. Louri
{"title":"PIXEL: Photonic Neural Network Accelerator","authors":"Kyle Shiflett, Dylan Wright, Avinash Karanth, A. Louri","doi":"10.1109/HPCA47549.2020.00046","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00046","url":null,"abstract":"Machine learning (ML) architectures such as Deep Neural Networks (DNNs) have achieved unprecedented accuracy on modern applications such as image classification and speech recognition. With power dissipation becoming a major concern in ML architectures, computer architects have focused on designing both energy-efficient hardware platforms as well as optimizing ML algorithms. To dramatically reduce power consumption and increase parallelism in neural network accelerators, disruptive technology such as silicon photonics has been proposed which can improve the performance-per-Watt when compared to electrical implementation. In this paper, we propose PIXEL - Photonic Neural Network Accelerator that efficiently implements the fundamental operation in neural computation, namely the multiply and accumulate (MAC) functionality using photonic components such as microring resonators (MRRs) and Mach-Zehnder interferometer (MZI). We design two versions of PIXEL - a hybrid version that multiplies optically and accumulates electrically and a fully optical version that multiplies and accumulates optically. We perform a detailed power, area and timing analysis of the different versions of photonic and electronic accelerators for different convolution neural networks (AlexNet, VGG16, and others). Our results indicate a significant improvement in the energy-delay product for both PIXEL designs over traditional electrical designs (48.4% for OE and 73.9% for OO) while minimizing latency, at the cost of increased area over electrical designs.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128676582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
FLOWER and FaME: A Low Overhead Bit-Level Fault-map and Fault-Tolerance Approach for Deeply Scaled Memories FLOWER和FaME:深度缩放存储器的低开销位级错误映射和容错方法
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00037
Donald Kline, Jiangwei Zhang, R. Melhem, A. Jones
{"title":"FLOWER and FaME: A Low Overhead Bit-Level Fault-map and Fault-Tolerance Approach for Deeply Scaled Memories","authors":"Donald Kline, Jiangwei Zhang, R. Melhem, A. Jones","doi":"10.1109/HPCA47549.2020.00037","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00037","url":null,"abstract":"To maintain appropriate yields in deeply scaled technologies requires fault-tolerance of increasingly high fault rates. These fault rates far exceed traditional general approaches such as ECC, particularly when faults accrue over time. Effective fault tolerance at such high fault rates requires detailed bit-level knowledge of the location of faulty cells. We provide a solution to this problem in the form of a space efficient, bit-level fault map called FLOWER. FLOWER utilizes Bloom filters to provide detailed fault characterization for a relatively small overhead. We demonstrate how FLOWER can enable improved fault tolerance at high fault rates by enhancing existing fault tolerance proposals and yielding 10–100x improvements. Using in-memory processing, FLOWER can maintain a less than 2% performance overhead at 10E-4 fault rates with less than 2% loss of memory density to report bit-level faults with high accuracy. Using a tuned novel hashing technique called MinCI, FLOWER for memory achieves considerably lower false positives than with disk-level hashing techniques at a fraction of the performance overhead. With a new technique to protect against errors during in-memory operations, PETAL bits, FLOWER can remain resilient against random errors while efficiently targeting predictable errors. Furthermore, we propose a new fault tolerance scheme called FaME, which provides ultra-efficient bit-level sparing by using the FLOWER fault map to identify the location of faults. FLOWER+FaME can achieve 14x longer PCM memory lifetime with half the area overhead versus SECDED ECC.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127462721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
A Hybrid Systolic-Dataflow Architecture for Inductive Matrix Algorithms 一种用于归纳矩阵算法的混合收缩-数据流体系结构
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00063
Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, Tony Nowatzki
{"title":"A Hybrid Systolic-Dataflow Architecture for Inductive Matrix Algorithms","authors":"Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, Tony Nowatzki","doi":"10.1109/HPCA47549.2020.00063","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00063","url":null,"abstract":"Dense linear algebra kernels are critical for wireless, and the oncoming proliferation of 5G only amplifies their importance. Due to the inductive nature of many such algorithms, parallelism is difficult to exploit: parallel regions have fine-grain producer/consumer interaction with iteratively changing depen-dence distance, reuse rate, and memory access patterns. This makes multi-threading impractical due to fine-grain synchronization, and vectorization ineffective due to the non-rectangular iteration domain. CPUs, DSPs, and GPUs perform order-of-magnitude below peak. Our insight is that if the nature of inductive dependences and memory accesses were explicit in the hardware/software interface, then a spatial architecture could efficiently execute parallel code regions. To this end, we first develop a novel execution model, inductive dataflow, where inductive dependence patterns and memory access patterns (streams) are first-order primitives. Second, we develop a hybrid spatial architecture combining systolic and tagged dataflow execution to attain high utilization at low energy and area cost. Finally, we create a scalable design through a novel vector-stream control model which amortizes control overhead both in time and spatially across architecture lanes. We evaluate our design, REVEL, with a full stack (compiler, ISA, simulator, RTL). Across a suite of linear algebra kernels, REVEL outperforms equally-provisioned DSPs by 4.6×-37×. Compared to state-of-the-art spatial architectures, REVEL is mean 3× faster. Compared to a set of ASICs, REVEL is only 2× the power and half the area.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114042723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Q-Zilla: A Scheduling Framework and Core Microarchitecture for Tail-Tolerant Microservices Q-Zilla:一个用于容忍尾部微服务的调度框架和核心微架构
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00026
Amirhossein Mirhosseini, Brendan L. West, G. Blake, T. Wenisch
{"title":"Q-Zilla: A Scheduling Framework and Core Microarchitecture for Tail-Tolerant Microservices","authors":"Amirhossein Mirhosseini, Brendan L. West, G. Blake, T. Wenisch","doi":"10.1109/HPCA47549.2020.00026","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00026","url":null,"abstract":"Managing tail latency is a primary challenge in designing large-scale Internet services. Queuing is a major contributor to end-to-end tail latency, wherein nominal tasks are enqueued behind rare, long ones, due to Head-of-Line (HoL) blocking. In this paper, we introduce Q-Zilla, a scheduling framework to tackle tail latency from a queuing perspective, and CoreZilla, a microarchitectural instantiation of our framework. On the algorthmic front, we first propose Server-Queue Decoupled Size-Interval Task Assignment (SQD-SITA), an efficient scheduling algorithm to minimize tail latency for high-disparity service distributions. SQD-SITA is inspired by an earlier algorithm, SITA, which explicitly seeks to address HoL blocking by providing an \"express-lane\" for short tasks, protecting them from queuing behind rare, long ones. But, SITA requires prior knowledge of task lengths to steer them into their corresponding lane, which is impractical. Furthermore, SITA may underperform an M/G/k system when some lanes become underutilized. In contrast, SQD-SITA uses incremental preemption to avoid the need for a priori task-size information, and dynamically reallocates servers to lanes to increase server utilization with no performance penalty. We then introduce Interruptible SQD-SITA, which further improves tail latency at the cost of additional preemptions. Finally, we describe and evaluate CoreZilla, wherein a multi-threaded core efficiently implements ISQD-SITA in a software-transparent manner at minimal cost. Our evaluation demonstrates that CoreZilla improves tail latency over a conventional SMT core with 2, 4, and 8 contexts by 2.25×, 3.23×, and 4.88×, on average, respectively.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"168 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133225081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Mitigating Voltage Drop in Resistive Memories by Dynamic RESET Voltage Regulation and Partition RESET 通过动态复位电压调节和分区复位缓解电阻式存储器中的电压降
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00031
Farzaneh Zokaee, Lei Jiang
{"title":"Mitigating Voltage Drop in Resistive Memories by Dynamic RESET Voltage Regulation and Partition RESET","authors":"Farzaneh Zokaee, Lei Jiang","doi":"10.1109/HPCA47549.2020.00031","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00031","url":null,"abstract":"The emerging resistive random access memory (ReRAM) technology has been deemed as one of the most promising alternatives to DRAM in main memories, due to its better scalability, zero cell leakage and short read latency. The cross-point (CP) array enables ReRAM to obtain the theoretical minimum 4F^2 cell size by placing a cell at the cross-point of a word-line and a bit-line. However, ReRAM CP arrays suffer from large sneak current resulting in significant voltage drop that greatly prolongs the array RESET latency. Although prior works reduce the voltage drop in CP arrays, they either substantially increase the array peripheral overhead or cannot work well with wear leveling schemes. In this paper, we propose two array micro-architecture level techniques, dynamic RESET voltage regulation (DRVR) and partition RESET (PR), to mitigate voltage drop on both bit-lines and word-lines in ReRAM CP arrays. DRVR dynamically provides higher RESET voltage to the cells far from the write driver and thus encountering larger voltage drop on a bit-line, so that all cells on a bit-line share approximately the same latency during RESETs. PR decides how many and which cells to reset online to partition the CP array into multiple equivalent circuits with smaller word-line resistance and voltage drop. Because DRVR and PR greatly reduce the array RESET latency, the ReRAM-based main memory lifetime under the worst case non-stop write traffic significantly decreases. To increase the CP array endurance, we further upgrade DRVR by providing lower RESET voltage to the cells suffering from less voltage drop on a word-line. Our experimental results show that, compared to the combination of prior voltage drop reduction techniques, our DRVR and PR improve the system performance by 11.7% and decrease the energy consumption by 46% averagely, while still maintaining >10-year main memory system lifetime.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129331016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers 只见树木不见森林:边缘数据中心的端到端人工智能应用性能
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00049
Daniel Richins, Dharmisha Doshi, Matthew Blackmore, A. Nair, Neha Pathapati, Ankit Patel, Brainard Daguman, Daniel Dobrijalowski, R. Illikkal, Kevin Long, David Zimmerman, V. Reddi
{"title":"Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers","authors":"Daniel Richins, Dharmisha Doshi, Matthew Blackmore, A. Nair, Neha Pathapati, Ankit Patel, Brainard Daguman, Daniel Dobrijalowski, R. Illikkal, Kevin Long, David Zimmerman, V. Reddi","doi":"10.1109/HPCA47549.2020.00049","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00049","url":null,"abstract":"Artificial intelligence and machine learning are experiencing widespread adoption in the industry, academia, and even public consciousness. This has been driven by the rapid advances in the applications and accuracy of AI through increasingly complex algorithms and models; this, in turn, has spurred research into developing specialized hardware AI accelerators. The rapid pace of the advances makes it easy to miss the forest for the trees: they are often developed and evaluated in a vacuum without considering the full application environment in which they must eventually operate. In this paper, we deploy and characterize Face Recognition, an AI-centric edge video analytics application built using open source and widely adopted infrastructure and ML tools. We evaluate its holistic, end-to-end behavior in a production-size edge data center and reveal the \"AI tax\" for all the processing that is involved. Even though the application is built around state-of-the-art AI and ML algorithms, it relies heavily on pre-and post-processing code which must be executed on a general-purpose CPU. As AI-centric applications start to reap the acceleration promised by so many accelerators, we find they impose stresses on the underlying software infrastructure and the data center's capabilities: storage and network bandwidth become major bottlenecks with increasing AI acceleration. By not having to serve a wide variety of applications, we show that a purpose-built edge data center can be designed to accommodate the stresses of accelerated AI at 15% lower TCO than one derived from homogeneous servers and infrastructure. We also discuss how our conclusions generalize beyond Face Recognition as many AI-centric applications at the edge rely upon the same underlying software and hardware infrastructure.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129604457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
ALRESCHA: A Lightweight Reconfigurable Sparse-Computation Accelerator ALRESCHA:一个轻量级可重构稀疏计算加速器
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI: 10.1109/HPCA47549.2020.00029
Bahar Asgari, Ramyad Hadidi, T. Krishna, Hyesoon Kim, S. Yalamanchili
{"title":"ALRESCHA: A Lightweight Reconfigurable Sparse-Computation Accelerator","authors":"Bahar Asgari, Ramyad Hadidi, T. Krishna, Hyesoon Kim, S. Yalamanchili","doi":"10.1109/HPCA47549.2020.00029","DOIUrl":"https://doi.org/10.1109/HPCA47549.2020.00029","url":null,"abstract":"Sparse problems that dominate a wide range of applications fail to effectively benefit from high memory bandwidth and concurrent computations in modern high-performance computer systems. Therefore, hardware accelerators have been proposed to capture a high degree of parallelism in sparse problems. However, the unexplored challenge for sparse problems is the limited opportunity for parallelism because of data dependencies, a common computation pattern in scientific sparse problems. Our key insight is to extract parallelism by mathematically transforming the computations into equivalent forms. The transformation breaks down the sparse kernels into a majority of independent parts and a minority of data-dependent ones and reorders these parts to gain performance. To implement the key insight, we propose a lightweight reconfigurable sparse-computation accelerator (Alrescha). To efficiently run the data-dependent and parallel parts and to enable fast switching between them, Alrescha makes two contributions. First, it implements a compute engine with a fixed compute unit for the parallel parts and a lightweight reconfigurable engine for the execution of the data-dependent parts. Second, Alrescha benefits from a locally-dense storage format, with the right order of non-zero values to yield the order of computations dictated by the transformation. The combination of the lightweight reconfigurable hardware and the storage format enables uninterrupted streaming from memory. Our simulation results show that compared to GPU, Alrescha achieves an average speedup of 15.6x for scientific sparse problems, and 8x for graph algorithms. Moreover, compared to GPU, Alrescha consumes 14x less energy.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128823379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信