{"title":"Achieving Forward Progress Guarantee in Small Hardware Transactions","authors":"Mahita Nagabhiru;Gregory T. Byrd","doi":"10.1109/LCA.2024.3370992","DOIUrl":"10.1109/LCA.2024.3370992","url":null,"abstract":"Hardware-transactional-memory (HTM) manages to pique interest from academia and industry alike because of its potential to ease concurrent-programming without compromising on performance. It offers a simple “all-or-nothing” idea to the programmer, making a piece of code appear atomic in hardware. Despite this and many elegant HTM implementations in research, only best-effort HTM is available commercially. Best-effort HTM lacks forward progress guarantee making it harder for the programmer to create a concurrent scalable fallback path. This has made HTM's adaptability limited. With a scope to support a myriad of applications, HTMs do a trade off between design and verification complexity vs forward progress guarantee. In this letter, we argue that limiting the scope of applications helps HTM attain guaranteed forward progress. We support lock-free programs by using HTM as multi-word-atomics and demonstrate strategic design choices to achieve lock-freedom completely in hardware. We use lfbench, a lock-free micro-benchmark-suite, and Arm's best-effort HTM (ARM_TME) on the gem5 simulator, as our base. We demonstrate the performance tradeoffs between design choices of a deferral-based, NACK-based, and NACK-with-backoff approaches. We show that NACK-with-backoff performs better than the others without compromising scalability for both read- and write-intensive applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"53-56"},"PeriodicalIF":2.3,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140007794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FullPack: Full Vector Utilization for Sub-Byte Quantized Matrix-Vector Multiplication on General Purpose CPUs","authors":"Hossein Katebi;Navidreza Asadi;Maziar Goudarzi","doi":"10.1109/LCA.2024.3370402","DOIUrl":"10.1109/LCA.2024.3370402","url":null,"abstract":"Sub-byte quantization on popular vector ISAs suffers from heavy waste of vector as well as memory bandwidth. The latest methods pack a number of quantized data in one vector, but have to pad them with empty bits to avoid overflow to neighbours. We remove even these empty bits and provide full utilization of the vector and memory bandwidth by our data-layout/compute co-design scheme. We implemented FullPack on TFLite for Vector-Matrix multiplication and showed up to \u0000<inline-formula><tex-math>$6.7times$</tex-math></inline-formula>\u0000 speedup, \u0000<inline-formula><tex-math>$2.75times$</tex-math></inline-formula>\u0000 on average on single layers, which translated to \u0000<inline-formula><tex-math>$1.56-2.11times$</tex-math></inline-formula>\u0000 end-to-end speedup on DeepSpeech.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"142-145"},"PeriodicalIF":1.4,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140007796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"JANM-IK: Jacobian Argumented Nelder-Mead Algorithm for Inverse Kinematics and its Hardware Acceleration","authors":"Yuxin Yang;Xiaoming Chen;Yinhe Han","doi":"10.1109/LCA.2024.3369940","DOIUrl":"10.1109/LCA.2024.3369940","url":null,"abstract":"Inverse kinematics is one of the core calculations in robotic applications and has strong performance requirements. Previous hardware acceleration work paid little attention to joint constraints, which can lead to computational failures. We propose a new inverse kinematics algorithm JANM-IK. It uses a hardware-friendly design, optimizes the Jacobian-based method and Nelder-Mead method, realizes the processing of joint constraints, and has a high convergence speed. We further designed its acceleration architecture to achieve high-performance computing through sufficient parallelism and hardware optimization. Finally, after experimental verification, JANM-IK can achieve a very high success rate and obtain certain performance improvements.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"45-48"},"PeriodicalIF":2.3,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139979255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Energy-Efficiency of Capsule Networks on Modern GPUs","authors":"Mohammad Hafezan;Ehsan Atoofian","doi":"10.1109/LCA.2024.3365149","DOIUrl":"10.1109/LCA.2024.3365149","url":null,"abstract":"Convolutional neural networks (CNNs) have become the compelling solution in machine learning applications as they surpass human-level accuracy in a certain set of tasks. Despite the success of CNNs, they classify images based on the identification of specific features, ignoring the spatial relationships between different features due to the pooling layer. The capsule network (CapsNet) architecture proposed by Google Brain's team is an attempt to address this drawback by grouping several neurons into a single capsule and learning the spatial correlations between different input features. Thus, the CapsNet identifies not only the presence of a feature but also its relationship with other features. However, the success of the CapsNet comes at the cost of underutilization of resources when it is run on a modern GPU equipped with tensor cores (TCs). Due to the structure of capsules in the CapsNet, quite often, functional units in a TC are underutilized which prolong the execution of capsule layers and increase energy consumption. In this work, we propose an architecture to eliminate ineffectual operations and improve energy-efficiency of GPUs. Experimental measurements over a set of state-of-the-art datasets show that the proposed approach improves energy-efficiency by 15% while maintaining the accuracy of CapsNets.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"49-52"},"PeriodicalIF":2.3,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minsik Cho;Keivan A. Vahid;Qichen Fu;Saurabh Adya;Carlo C. Del Mundo;Mohammad Rastegari;Devang Naik;Peter Zatloukal
{"title":"eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models","authors":"Minsik Cho;Keivan A. Vahid;Qichen Fu;Saurabh Adya;Carlo C. Del Mundo;Mohammad Rastegari;Devang Naik;Peter Zatloukal","doi":"10.1109/LCA.2024.3363492","DOIUrl":"10.1109/LCA.2024.3363492","url":null,"abstract":"Since Large Language Models or LLMs have demonstrated high-quality performance on many complex language tasks, there is a great interest in bringing these LLMs to mobile devices for faster responses and better privacy protection. However, the size of LLMs (i.e., billions of parameters) requires highly effective compression to fit into storage-limited devices. Among many compression techniques, weight-clustering, a form of non-linear quantization, is one of the leading candidates for LLM compression, and supported by modern smartphones. Yet, its training overhead is prohibitively significant for LLM fine-tuning. Especially, Differentiable KMeans Clustering, or DKM, has shown the state-of-the-art trade-off between compression ratio and accuracy regression, but its large memory complexity makes it nearly impossible to apply to train-time LLM compression. In this letter, we propose a memory-efficient DKM implementation, eDKM powered by novel techniques to reduce the memory footprint of DKM by orders of magnitudes. For a given tensor to be saved on CPU for the backward pass of DKM, we compressed the tensor by applying uniquification and sharding after checking if there is no duplicated tensor previously copied to CPU. Our experimental results demonstrate that eDKM can fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB (3 b/weight) with the Alpaca dataset by reducing the train-time memory footprint of a decoder layer by 130×, while delivering good accuracy on broader LLM benchmarks (i.e., 77.7% for PIQA, 66.1% for Winograde, and so on).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"37-40"},"PeriodicalIF":2.3,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"R.I.P. Geomean Speedup Use Equal-Work (Or Equal-Time) Harmonic Mean Speedup Instead","authors":"Lieven Eeckhout","doi":"10.1109/LCA.2024.3361925","DOIUrl":"10.1109/LCA.2024.3361925","url":null,"abstract":"How to accurately summarize average performance is challenging. While geometric mean speedup is prevalently used, it is meaningless. Instead, this paper argues for harmonic mean speedup which accurately summarizes how much faster a workload executes on a target system relative to a baseline. We propose the equal-work and equal-time harmonic mean speedup metrics to explicitly expose the different assumptions they make, and we further suggest that equal-work speedup is most relevant to computer architecture research. The paper demonstrates that which average speedup is used matters in practice as inappropriate averages may lead to incorrect conclusions.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"78-82"},"PeriodicalIF":2.3,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samuel Thomas;Kidus Workneh;Ange-Thierry Ishimwe;Zack McKevitt;Phaedra Curlin;R. Iris Bahar;Joseph Izraelevitz;Tamara Lehman
{"title":"Baobab Merkle Tree for Efficient Secure Memory","authors":"Samuel Thomas;Kidus Workneh;Ange-Thierry Ishimwe;Zack McKevitt;Phaedra Curlin;R. Iris Bahar;Joseph Izraelevitz;Tamara Lehman","doi":"10.1109/LCA.2024.3360709","DOIUrl":"10.1109/LCA.2024.3360709","url":null,"abstract":"Secure memory is a natural solution to hardware vulnerabilities in memory, but it faces fundamental challenges of performance and memory overheads. While significant work has gone into optimizing the protocol for performance, far less work has gone into optimizing its memory overhead. In this work, we propose the \u0000<italic>Baobab Merkle Tree</i>\u0000, in which counters are memoized in an on-chip table. The Baobab Merkle Tree reduces spatial overhead of a Bonsai Merkle Tree by 2-4X without incurring performance overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"33-36"},"PeriodicalIF":2.3,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Primate: A Framework to Automatically Generate Soft Processors for Network Applications","authors":"Rui Ma;Jia-Ching Hsu;Ali Mansoorshahi;Joseph Garvey;Michael Kinsner;Deshanand Singh;Derek Chiou","doi":"10.1109/LCA.2024.3358839","DOIUrl":"10.1109/LCA.2024.3358839","url":null,"abstract":"Overlay processors on FPGAs enable i) software programmability through sequential code calling library functions, ii) high performance by converting the library calls to invocations of corresponding accelerators, and iii) faster deployment than reprogramming the FPGA. Traditionally, overlays have been hand-written in RTL and programmed through handwritten assembly. We present the Primate framework, which automatically generates overlays from applications written in annotated C++. We evaluated Primate on Whippersnapper (Dang et al. 2017) P4 benchmarks. Primate Overlay latencies are 0.06x - 0.15x compared to PISCES (Shahbaz et al. 2016), a high-performance CPU solution, and 0.25x - 2.3x compared to solutions generated by P4FPGA (Wang et al. 2017), a P4 HLS compiler on FPGA.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"57-60"},"PeriodicalIF":2.3,"publicationDate":"2024-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Asif Ali Khan;Fazal Hameed;Taha Shahroodi;Alex K. Jones;Jeronimo Castrillon
{"title":"Efficient Memory Layout for Pre-Alignment Filtering of Long DNA Reads Using Racetrack Memory","authors":"Asif Ali Khan;Fazal Hameed;Taha Shahroodi;Alex K. Jones;Jeronimo Castrillon","doi":"10.1109/LCA.2024.3350701","DOIUrl":"10.1109/LCA.2024.3350701","url":null,"abstract":"DNA sequence alignment is a fundamental and computationally expensive operation in bioinformatics. Researchers have developed \u0000<i>pre-alignment</i>\u0000 filters that effectively reduce the amount of data consumed by the alignment process by discarding locations that result in a poor match. However, the filtering operation itself is memory-intensive for which the conventional Von-Neumann architectures perform poorly. Therefore, recent designs advocate compute near memory (CNM) accelerators based on stacked DRAM and more exotic memory technologies such as \u0000<i>racetrack memories</i>\u0000 (RTM). However, these designs only support small DNA reads of circa 100 nucleotides, referred to as \u0000<i>short reads</i>\u0000. This letter proposes a CNM system for handling both long and short reads. It introduces a novel data-placement solution that significantly increases parallelism and reduces overhead. Evaluation results show substantial reductions in execution time (\u0000<inline-formula><tex-math>$1.32times$</tex-math></inline-formula>\u0000) and energy consumption (50%), compared to the state-of-the-art.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"129-132"},"PeriodicalIF":2.3,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeMM: A Decoupled Matrix Multiplication Engine Supporting Relaxed Structured Sparsity","authors":"Christodoulos Peltekis;Vasileios Titopoulos;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos","doi":"10.1109/LCA.2024.3355178","DOIUrl":"10.1109/LCA.2024.3355178","url":null,"abstract":"Deep Learning (DL) has achieved unprecedented success in various application domains. Meanwhile, model pruning has emerged as a viable solution to reduce the footprint of DL models in mobile applications, without compromising their accuracy. To enable the matrix engines built for dense DL models to also handle their pruned counterparts, pruned DL models follow a fine-grained structured sparsity pattern of 1:4, or 2:4, whereby in each group of four contiguous values, at least one, or two, respectively, must be non-zero. Structured sparsity has recently also moved to coarser (relaxed) cases of \u0000<inline-formula><tex-math>$N$</tex-math></inline-formula>\u0000:128, or \u0000<inline-formula><tex-math>$N$</tex-math></inline-formula>\u0000:256, for small values of \u0000<inline-formula><tex-math>$N$</tex-math></inline-formula>\u0000, targeting a wider range of sparsity (10%-90%) for the DL models. In this work, we design an accelerator that operates, by construction, on wide blocks with relaxed structured sparsity. In contrast to the conventional systolic array archetype, the new engine decouples the memory part of the systolic array from the multiply-add units. The memory block comprises 1 write and \u0000<inline-formula><tex-math>$N$</tex-math></inline-formula>\u0000 read ports, with the number of read ports being equal to the number of non-zero elements per row. The multiply-add units connect directly to each read port and complete the multiplication in a row-wise product-first order. More importantly, simple reconfiguration facilitates more dense patterns. The experimental evaluation demonstrates substantial latency improvements over current state-of-the-art systolic array engines built for fine-grained and relaxed structured sparsity.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"17-20"},"PeriodicalIF":2.3,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139528673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}