ACM Transactions on Architecture and Code Optimization最新文献_第2页

Knowledge-Augmented Mutation-Based Bug Localization for Hardware Design Code 基于知识增强突变的硬件设计代码错误定位

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-04-22 DOI: 10.1145/3660526

Jiang Wu, Zhuo Zhang, Deheng Yang, Jianjun Xu, Jiayu He, Xiaoguang Mao

{"title":"Knowledge-Augmented Mutation-Based Bug Localization for Hardware Design Code","authors":"Jiang Wu, Zhuo Zhang, Deheng Yang, Jianjun Xu, Jiayu He, Xiaoguang Mao","doi":"10.1145/3660526","DOIUrl":"https://doi.org/10.1145/3660526","url":null,"abstract":"Verification of hardware design code is crucial for the quality assurance of hardware products. Being an indispensable part of verification, localizing bugs in the hardware design code is significant for hardware development but is often regarded as a notoriously difficult and time-consuming task. Thus, automated bug localization techniques that could assist manual debugging have attracted much attention in the hardware community. However, existing approaches are hampered by the challenge of achieving both demanding bug localization accuracy and facile automation in a single method. Simulation-based methods are fully automated but have limited localization accuracy, slice-based techniques can only give an approximate range of the presence of bugs, and spectrum-based techniques can also only yield a reference value for the likelihood that a statement is buggy. Furthermore, formula-based bug localization techniques suffer from the complexity of combinatorial explosion for automated application in industrial large-scale hardware designs. In this work, we propose Kummel, a <underline>K</underline>nowledge-a<underline>u</underline>g<underline>m</underline>ented <underline>m</underline>utation-bas<underline>e</underline>d bug loca<underline>l</underline>ization for hardware design code to address these limitations. Kummel achieves the unity of precise bug localization and full automation by utilizing the knowledge augmentation through mutation analysis. To evaluate the effectiveness of Kummel, we conduct large-scale experiments on 76 versions of 17 hardware projects by seven state-of-the-art bug localization techniques. The experimental results clearly show that Kummel is statistically more effective than baselines, e.g., our approach can improve the seven original methods by 64.48% on average under the RImp metric. It brings fresh insights of hardware bug localization to the community.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"115 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign COER：用于 RDMA 和拥塞控制协议代码设计的网络接口卸载架构

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-04-22 DOI: 10.1145/3660525

Ke Wu, Dezun Dong, Weixia Xu

{"title":"COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign","authors":"Ke Wu, Dezun Dong, Weixia Xu","doi":"10.1145/3660525","DOIUrl":"https://doi.org/10.1145/3660525","url":null,"abstract":"RDMA (Remote Direct Memory Access) networks require efficient congestion control to maintain their high throughput and low latency characteristics. However, congestion control protocols deployed at the software layer suffer from slow response times due to the communication overhead between host hardware and software. This limitation has hindered their ability to meet the demands of high-speed networks and applications. Harnessing the capabilities of rapidly advancing Network Interface Card (NIC) can drive progress in congestion control. Some simple congestion control protocols have been offloaded to RDMA NIC to enable faster detection and processing of congestion. However, offloading congestion control to the RDMA NIC faces a significant challenge in integrating the RDMA transport protocol with advanced congestion control protocols that involve complex mechanisms. We have observed that reservation-based proactive congestion control protocols share strong similarities with RDMA transport protocols, allowing them to integrate seamlessly and combine the functionalities of the transport layer and network layer. In this paper, we present COER, an RDMA NIC architecture that leverages the functional components of RDMA to perform reservations and completes the scheduling of congestion control during the scheduling process of the RDMA protocol. COER facilitates the streamlined development of offload strategies for congestion control techniques, specifically proactive congestion control, on RDMA NIC. We use COER to design offloading schemes for eleven congestion control protocols, which we implement and evaluate using a network emulator with a cycle-accurate RDMA NIC model that can load MPI programs. The evaluation results demonstrate that the architecture of COER does not compromise the original characteristics of the congestion control protocols. Compared to a layered protocol stack approach, COER enables the performance of RDMA networks to reach new heights.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"20 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Intermediate Address Space: virtual memory optimization of heterogeneous architectures for cache-resident workloads 中间地址空间：针对高速缓存驻留工作负载优化异构架构的虚拟内存

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-04-20 DOI: 10.1145/3659207

Qunyou Liu, Darong Huang, Luis Costero, Marina Zapater, David Atienza

{"title":"Intermediate Address Space: virtual memory optimization of heterogeneous architectures for cache-resident workloads","authors":"Qunyou Liu, Darong Huang, Luis Costero, Marina Zapater, David Atienza","doi":"10.1145/3659207","DOIUrl":"https://doi.org/10.1145/3659207","url":null,"abstract":"The increasing demand for computing power and the emergence of heterogeneous computing architectures have driven the exploration of innovative techniques to address current limitations in both the compute and memory subsystems. One such solution is the use of Accelerated Processing Units (APUs), processors that incorporate both a central processing unit (CPU) and an integrated graphics processing unit (iGPU). However, the performance of both APU and CPU systems can be significantly hampered by address translation overhead, leading to a decline in overall performance, especially for cache-resident workloads. To address this issue, we propose the introduction of a new intermediate address space (IAS) in both APU and CPU systems. IAS serves as a bridge between virtual address (VA) spaces and physical address (PA) spaces, optimizing the address translation process. In the case of APU systems, our research indicates that the iGPU suffers from significant translation look-aside buffer (TLB) misses in certain workload situations. Using an IAS, we can divide the initial address translation into front- and back-end phases, effectively shifting the bottleneck in address translation from the cache side to the memory controller side, a technique that proves to be effective for cache-resident workloads. Our simulations demonstrate that implementing IAS in the CPU system can boost performance by up to 40% compared to conventional CPU systems. Furthermore, we evaluate the effectiveness of APU systems, comparing the performance of IAS-based systems with traditional systems, showing up to a 185% improvement in APU system performance with our proposed IAS implementation. Furthermore, our analysis indicates that over 90% of TLB misses can be filtered by the cache, and employing a larger cache within the system could potentially result in even greater improvements. The proposed IAS offers a promising and practical solution to enhance the performance of both APU and CPU systems, contributing to state-of-the-art research in the field of computer architecture.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"9 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140625640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ReHarvest: an ADC Resource-Harvesting Crossbar Architecture for ReRAM-Based DNN Accelerators ReHarvest：基于 ReRAM 的 DNN 加速器的 ADC 资源收集交叉条架构

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-04-17 DOI: 10.1145/3659208

Jiahong Xu, Haikun Liu, Zhuohui Duan, Xiaofei Liao, Hai Jin, Xiaokang Yang, Huize Li, Cong Liu, Fubing Mao, Yu Zhang

{"title":"ReHarvest: an ADC Resource-Harvesting Crossbar Architecture for ReRAM-Based DNN Accelerators","authors":"Jiahong Xu, Haikun Liu, Zhuohui Duan, Xiaofei Liao, Hai Jin, Xiaokang Yang, Huize Li, Cong Liu, Fubing Mao, Yu Zhang","doi":"10.1145/3659208","DOIUrl":"https://doi.org/10.1145/3659208","url":null,"abstract":"ReRAM-based Processing-In-Memory (PIM) architectures have been increasingly explored to accelerate various Deep Neural Network (DNN) applications because they can achieve extremely high performance and energy-efficiency for in-situ analog Matrix-Vector Multiplication (MVM) operations. However, since ReRAM crossbar arrays’ peripheral circuits–analog-to-digital converters (ADCs) often feature high latency and low area efficiency, AD conversion has become a performance bottleneck of in-situ analog MVMs. Moreover, since each crossbar array is tightly coupled with very limited ADCs in current ReRAM-based PIM architectures, the scarce ADC resource is often underutilized. In this paper, we propose ReHarvest, an ADC-crossbar decoupled architecture to improve the utilization of ADC resource. Particularly, we design a many-to-many mapping structure between crossbars and ADCs to share all ADCs in a tile as a resource pool, and thus one crossbar array can harvest much more ADCs to parallelize the AD conversion for each MVM operation. Moreover, we propose a multi-tile matrix mapping (MTMM) scheme to further improve the ADC utilization across multiple tiles by enhancing data parallelism. To support fine-grained data dispatching for the MTMM, we also design a bus-based interconnection network to multicast input vectors among multiple tiles, and thus eliminate data redundancy and potential network congestion during multicasting. Extensive experimental results show that ReHarvest can improve the ADC utilization by 3.2 ×, and achieve 3.5 × performance speedup while reducing the ReRAM resource consumption by 3.1 × on average compared with the state-of-the-art PIM architecture–FORMS.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"54 5 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140612507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU 并行梅克尔树遍历实例：GPU 上的后量子莱顿-米卡里签名

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-04-16 DOI: 10.1145/3659209

Ziheng Wang, Xiaoshe Dong, Yan Kang, Heng Chen, Qiang Wang

{"title":"An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU","authors":"Ziheng Wang, Xiaoshe Dong, Yan Kang, Heng Chen, Qiang Wang","doi":"10.1145/3659209","DOIUrl":"https://doi.org/10.1145/3659209","url":null,"abstract":"The hash-based signature (HBS) is the most conservative and time-consuming among many post-quantum cryptography (PQC) algorithms. Two HBSs, LMS and XMSS, are the only PQC algorithms standardised by the National Institute of Standards and Technology (NIST) now. Existing HBSs are designed based on serial Merkle tree traversal, which is not conducive to taking full advantage of the computing power of parallel architectures such as CPUs and GPUs. We propose a parallel Merkle tree traversal (PMTT), which is tested by implementing LMS on the GPU. This is the first work accelerating LMS on the GPU, which performs well even with over 10,000 cores. Considering different scenarios of algorithmic parallelism and data parallelism, we implement corresponding variants for PMTT. The design of PMTT for algorithmic parallelism mainly considers the execution efficiency of a single task, while that for data parallelism starts with the full utilisation of GPU performance. In addition, we are the first to design a CPU-GPU collaborative processing solution for traversal algorithms to reduce the communication overhead between CPU and GPU. For algorithmic parallelism, our implementation is still 4.48 × faster than the ideal time of the state-of-the-art traversal algorithm. For data parallelism, when the number of cores increases from 1 to 8192, the parallel efficiency is 78.39%. In comparison, our LMS implementation outperforms most existing LMS and XMSS implementations.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"67 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140588978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

D2Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated Storage D2Comp：在分解存储上使用数据处理单元高效卸载 LSM 树压缩

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-04-09 DOI: 10.1145/3656584

Chen Ding, Jian Zhou, Kai Lu, Sicen Li, Yiqin Xiong, Jiguang Wan, Ling Zhan

{"title":"D2Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated Storage","authors":"Chen Ding, Jian Zhou, Kai Lu, Sicen Li, Yiqin Xiong, Jiguang Wan, Ling Zhan","doi":"10.1145/3656584","DOIUrl":"https://doi.org/10.1145/3656584","url":null,"abstract":"LSM-based key-value stores suffer from sub-optimal performance due to their slow and heavy background compactions. The compaction brings severe CPU and network overhead on high-speed disaggregated storage. This paper further reveals that data-intensive compression in compaction consumes a significant portion of CPU power. Moreover, the multi-threaded compactions cause substantial CPU contention and network traffic during high-load periods. Based on the above observations, we propose fine-grained dynamical compaction offloading by leveraging the modern Data Processing Unit (DPU) to alleviate the CPU and network overhead. To achieve this, we first customized a file system to enable efficient data access for DPU. We then leverage the Arm cores on the DPU to meet the burst CPU and network requirements to reduce resource contention and data movement. We further employ dedicated hardware-based accelerators on the DPU to speed up the compression in compactions. We integrate our DPU-offloaded compaction with RocksDB and evaluate it with NVIDIA’s latest Bluefield-2 DPU on a real system. The evaluation shows that the DPU is an effective solution to solve the CPU bottleneck and reduce data traffic of compaction. The results show that compaction performance is accelerated by 2.86 to 4.03 times, system write and read throughput is improved by up to 3.2 times and 1.4 times respectively, and host CPU contention and network traffic are effectively reduced compared to the fine-tuned CPU-only baseline.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"16 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140589307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

iSwap: A New Memory Page Swap Mechanism for Reducing Ineffective I/O Operations in Cloud Environments iSwap：减少云环境中无效 I/O 操作的新型内存页交换机制

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-03-23 DOI: 10.1145/3653302

Zhuohao Wang, Lei Liu, Limin Xiao

引用次数: 0

ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors ReSA：用于多个微小 DNN 张量的可重构收缩阵列

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-03-21 DOI: 10.1145/3653363

Ching-Jui Lee, Tsung Tai Yeh

{"title":"ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors","authors":"Ching-Jui Lee, Tsung Tai Yeh","doi":"10.1145/3653363","DOIUrl":"https://doi.org/10.1145/3653363","url":null,"abstract":"Systolic array architecture has significantly accelerated deep neural networks (DNNs). A systolic array comprises multiple processing elements (PEs) that can perform multiply-accumulate (MAC). Traditionally, the systolic array can execute a certain amount of tensor data that matches the size of the systolic array simultaneously at each cycle. However, hyper-parameters of DNN models differ across each layer and result in various tensor sizes in each layer. Mapping these irregular tensors to the systolic array while fully utilizing the entire PEs in a systolic array is challenging. Furthermore, modern DNN systolic accelerators typically employ a single dataflow. However, such a dataflow isn’t optimal for every DNN model. This work proposes ReSA, a reconfigurable dataflow architecture that aims to minimize the execution time of a DNN model by mapping tiny tensors on the spatially partitioned systolic array. Unlike conventional systolic array architectures, the ReSA data path controller enables the execution of the input, weight, and output-stationary dataflow on PEs. ReSA also decomposes the coarse-grain systolic array into multiple small ones to reduce the fragmentation issue on the tensor mapping. Each small systolic sub-array unit relies on our data arbiter to dispatch tensors to each other through the simple interconnected network. Furthermore, ReSA reorders the memory access to overlap the memory load and execution stages to hide the memory latency when tackling tiny tensors. Finally, ReSA splits tensors of each layer into multiple small ones and searches for the best dataflow for each tensor on the host side. Then, ReSA encodes the predefined dataflow in our proposed instruction to notify the systolic array to switch the dataflow correctly. As a result, our optimization on the systolic array architecture achieves a geometric mean speedup of 1.87X over the weight-stationary systolic array architecture across 9 different DNN models.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"19 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-Core Data Sharing for Energy-Efficient GPUs 跨核数据共享实现高能效 GPU

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-03-18 DOI: 10.1145/3653019

Hajar Falahati, Mohammad Sadrosadati, Qiumin Xu, Juan Gómez-Luna, Banafsheh Saber Latibari, Hyeran Jeon, Shaahin Hesaabi, Hamid Sarbazi-Azad, Onur Mutlu, Murali Annavaram, Masoud Pedram

{"title":"Cross-Core Data Sharing for Energy-Efficient GPUs","authors":"Hajar Falahati, Mohammad Sadrosadati, Qiumin Xu, Juan Gómez-Luna, Banafsheh Saber Latibari, Hyeran Jeon, Shaahin Hesaabi, Hamid Sarbazi-Azad, Onur Mutlu, Murali Annavaram, Masoud Pedram","doi":"10.1145/3653019","DOIUrl":"https://doi.org/10.1145/3653019","url":null,"abstract":"Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application domains because they can accelerate massively parallel workloads and can be easily programmed using general-purpose programming frameworks such as CUDA and OpenCL. Each Streaming Multiprocessor (SM) contains an L1 data cache (L1D) to exploit the locality in data accesses. L1D misses are costly for GPUs due to two reasons. First, L1D misses consume a lot of energy as they need to access the L2 cache (L2) via an on-chip network and the off-chip DRAM in case of L2 misses. Second, L1D misses impose performance overhead if the GPU does not have enough active warps to hide the long memory access latency. We observe that threads running on different SMs share 55% of the data they read from the memory. Unfortunately, as the L1Ds are in the non-coherent memory domain, each SM independently fetches data from the L2 or the off-chip memory into its L1D, even though the data may be currently available in the L1D of another SM. Our goal is to service L1D read misses via other SMs, as much as possible, to cut down costly accesses to the L2 or the off-chip DRAM. To this end, we propose a new data sharing mechanism, called Cross-Core Data Sharing (CCDS). CCDS employs a predictor to estimate whether or not the required cache block exists in another SM. If the block is predicted to exist in another SM’s L1D, CCDS fetches the data from the L1D that contains the block. Our experiments on a suite of 26 workloads show that CCDS improves average energy and performance by 1.30 × and 1.20 ×, respectively, compared to the baseline GPU. Compared to the state-of-the-art data-sharing mechanism, CCDS improves average energy and performance by 1.37 × and 1.11 ×, respectively.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"23 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140156603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication Cerberus：稀疏矩阵和矢量乘法的三重模式加速

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2024-03-17 DOI: 10.1145/3653020

Soojin Hwang, Daehyeon Baek, Jongse Park, Jaehyuk Huh

{"title":"Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication","authors":"Soojin Hwang, Daehyeon Baek, Jongse Park, Jaehyuk Huh","doi":"10.1145/3653020","DOIUrl":"https://doi.org/10.1145/3653020","url":null,"abstract":"The multiplication of sparse matrix and vector (SpMV) is one of the most widely used kernels in high-performance computing as well as machine learning acceleration for sparse neural networks. The design space of SpMV accelerators has two axes: algorithm and matrix representation. There have been two widely used algorithms and data representations. Two algorithms, scalar multiplication and dot product, can be combined with two sparse data representations, compressed sparse and bitmap formats for the matrix and vector. Although the prior accelerators adopted one of the possible designs, it is yet to be investigated which design is the best one across different hardware resources and workload characteristics. This paper first investigates the impact of design choices with respect to the algorithm and data representation. Our evaluation shows that no single design always outperforms the others across different workloads, but the two best designs (i.e. compressed sparse format and bitmap format with dot product) have complementary performance with trade-offs incurred by the matrix characteristics. Based on the analysis, this study proposes Cerberus, a triple-mode accelerator supporting two sparse operation modes in addition to the base dense mode. To allow such multi-mode operation, it proposes a prediction model based on matrix characteristics under a given hardware configuration, which statically selects the best mode for a given sparse matrix with its dimension and density information. Our experimental results show that Cerberus provides 12.1 × performance improvements from a dense-only accelerator, and 1.5 × improvements from a fixed best SpMV design.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"28 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140151823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0