{"title":"Approximate Multiplier Design With LFSR-Based Stochastic Sequence Generators for Edge AI","authors":"Mrinmay Sasmal;Tresa Joseph;Bindiya T. S.","doi":"10.1109/LCA.2024.3379002","DOIUrl":"10.1109/LCA.2024.3379002","url":null,"abstract":"This letter introduces an innovative approximate multiplier (AM) architecture that leverages stochastically generated bit streams through the Linear Feedback Shift Register (LFSR). The AM is applied to matrix-vector multiplication (MVM) in Neural Networks (NNs). The hardware implementations in 90 nm CMOS technology demonstrate superior power and area efficiency compared to state-of-the-art designs. Additionally, the study explores applying stochastic computing to LSTM NNs, showcasing improved energy efficiency and speed.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"91-94"},"PeriodicalIF":2.3,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140169772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hashing ATD Tags for Low-Overhead Safe Contention Monitoring","authors":"Pablo Andreu;Pedro Lopez;Carles Hernandez","doi":"10.1109/LCA.2024.3401570","DOIUrl":"10.1109/LCA.2024.3401570","url":null,"abstract":"Increasing the performance of safety-critical systems via introducing multicore processors is becoming the norm. However, when multiple cores access a shared cache, inter-core evictions become a relevant source of interference that must be appropriately controlled. To solve this issue, one can statically partition caches and remove the interference. Unfortunately, this comes at the expense of less flexibility and, in some cases, worse performance. In this context, enabling more flexible cache allocation policies requires additional monitoring support. This paper proposes HashTAG, a novel approach to accurately upper-bound inter-core eviction interference. HashTAG enables a low-overhead implementation of an Auxiliary Tag Directory to determine inter-core evictions. Our results show that no inter-task interference underprediction is possible with HashTAG while providing a 44% reduction in ATD area with only 1.14% median overprediction.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"166-169"},"PeriodicalIF":1.4,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10530895","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141063379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changmin Shin;Taehee Kwon;Jaeyong Song;Jae Hyung Ju;Frank Liu;Yeonkyu Choi;Jinho Lee
{"title":"A Case for In-Memory Random Scatter-Gather for Fast Graph Processing","authors":"Changmin Shin;Taehee Kwon;Jaeyong Song;Jae Hyung Ju;Frank Liu;Yeonkyu Choi;Jinho Lee","doi":"10.1109/LCA.2024.3376680","DOIUrl":"10.1109/LCA.2024.3376680","url":null,"abstract":"Because of the widely recognized memory wall issue, modern DRAMs are increasingly being assigned innovative functionalities beyond the basic read and write operations. Often referred to as “function-in-memory”, these techniques are crafted to leverage the abundant internal bandwidth available within the DRAM. However, these techniques face several challenges, including requiring large areas for arithmetic units and the necessity of splitting a single word into multiple pieces. These challenges severely limit the practical application of these function-in-memory techniques. In this paper, we present Piccolo, an efficient design of random scatter-gather memory. Our method achieves significant improvements with minimal overhead. By demonstrating our technique on a graph processing accelerator, we show that Piccolo and the proposed accelerator achieves \u0000<inline-formula><tex-math>$1.2-3.1 times$</tex-math></inline-formula>\u0000 speedup compared to the prior art.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"73-77"},"PeriodicalIF":2.3,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140124987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SparseLeakyNets: Classification Prediction Attack Over Sparsity-Aware Embedded Neural Networks Using Timing Side-Channel Information","authors":"Saurav Maji;Kyungmi Lee;Anantha P. Chandrakasan","doi":"10.1109/LCA.2024.3397730","DOIUrl":"10.1109/LCA.2024.3397730","url":null,"abstract":"This letter explores security vulnerabilities in sparsity-aware optimizations for Neural Network (NN) platforms, specifically focusing on timing side-channel attacks introduced by optimizations such as skipping sparse multiplications. We propose a classification prediction attack that utilizes this timing side-channel information to mimic the NN's prediction outcomes. Our techniques were demonstrated for CIFAR-10, MNIST, and biomedical classification tasks using diverse dataflows and processing loads in timing models. The demonstrated results could predict the original classification decision with high accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"133-136"},"PeriodicalIF":2.3,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140925787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deepanjali Mishra;Konstantinos Kanellopoulos;Ashish Panwar;Akshitha Sriraman;Vivek Seshadri;Onur Mutlu;Todd C. Mowry
{"title":"Address Scaling: Architectural Support for Fine-Grained Thread-Safe Metadata Management","authors":"Deepanjali Mishra;Konstantinos Kanellopoulos;Ashish Panwar;Akshitha Sriraman;Vivek Seshadri;Onur Mutlu;Todd C. Mowry","doi":"10.1109/LCA.2024.3373760","DOIUrl":"10.1109/LCA.2024.3373760","url":null,"abstract":"In recent decades, software systems have grown significantly in size and complexity. As a result, such systems are more prone to bugs which can cause performance and correctness challenges. Using run-time monitoring tools is one approach to mitigate these challenges. However, these tools maintain metadata for every byte of application data they monitor, which precipitates performance overheads from additional metadata accesses. We propose \u0000<italic>Address Scaling</i>\u0000, a new hardware framework that performs fine-grained metadata management to reduce metadata access overheads in run-time monitoring tools. Our mechanism is based on the observation that different run-time monitoring tools maintain metadata at varied granularities. Our key insight is to maintain the data and its corresponding metadata within the same cache line, to preserve locality. \u0000<italic>Address Scaling</i>\u0000 improves the performance of \u0000<monospace>Memcheck</monospace>\u0000, a dynamic monitoring tool that detects memory-related errors, by 3.55× and 6.58× for sequential and random memory access patterns respectively, compared to the state-of-the-art systems that store the metadata in a memory region that is separate from the data.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"69-72"},"PeriodicalIF":2.3,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140057254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ali Mohammadpur-Fard;Sina Darabi;Hajar Falahati;Negin Mahani;Hamid Sarbazi-Azad
{"title":"Exploiting Direct Memory Operands in GPU Instructions","authors":"Ali Mohammadpur-Fard;Sina Darabi;Hajar Falahati;Negin Mahani;Hamid Sarbazi-Azad","doi":"10.1109/LCA.2024.3371062","DOIUrl":"10.1109/LCA.2024.3371062","url":null,"abstract":"GPUs are widely used for diverse applications, particularly data-parallel tasks like machine learning and scientific computing. However, their efficiency is hindered by architectural limitations, inherited from historical RISC processors, in handling memory loads causing high register file contention. We observe that a significant number (around 26%) of values present in the register file are typically used only once, contributing to more than 25% of the total register file bank conflicts, on average. This paper addresses the challenge of single-use memory values in the GPU register file (i.e. data values used only once) which wastes space and increases latency. To this end, we introduce a novel mechanism inspired by CISC architectures. It replaces single-use loads with direct memory operands in arithmetic operations. Our approach improves performance by 20% and reduces energy consumption by 18%, on average, with negligible (<1%) hardware overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"162-165"},"PeriodicalIF":1.4,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140047828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Achieving Forward Progress Guarantee in Small Hardware Transactions","authors":"Mahita Nagabhiru;Gregory T. Byrd","doi":"10.1109/LCA.2024.3370992","DOIUrl":"10.1109/LCA.2024.3370992","url":null,"abstract":"Hardware-transactional-memory (HTM) manages to pique interest from academia and industry alike because of its potential to ease concurrent-programming without compromising on performance. It offers a simple “all-or-nothing” idea to the programmer, making a piece of code appear atomic in hardware. Despite this and many elegant HTM implementations in research, only best-effort HTM is available commercially. Best-effort HTM lacks forward progress guarantee making it harder for the programmer to create a concurrent scalable fallback path. This has made HTM's adaptability limited. With a scope to support a myriad of applications, HTMs do a trade off between design and verification complexity vs forward progress guarantee. In this letter, we argue that limiting the scope of applications helps HTM attain guaranteed forward progress. We support lock-free programs by using HTM as multi-word-atomics and demonstrate strategic design choices to achieve lock-freedom completely in hardware. We use lfbench, a lock-free micro-benchmark-suite, and Arm's best-effort HTM (ARM_TME) on the gem5 simulator, as our base. We demonstrate the performance tradeoffs between design choices of a deferral-based, NACK-based, and NACK-with-backoff approaches. We show that NACK-with-backoff performs better than the others without compromising scalability for both read- and write-intensive applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"53-56"},"PeriodicalIF":2.3,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140007794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FullPack: Full Vector Utilization for Sub-Byte Quantized Matrix-Vector Multiplication on General Purpose CPUs","authors":"Hossein Katebi;Navidreza Asadi;Maziar Goudarzi","doi":"10.1109/LCA.2024.3370402","DOIUrl":"10.1109/LCA.2024.3370402","url":null,"abstract":"Sub-byte quantization on popular vector ISAs suffers from heavy waste of vector as well as memory bandwidth. The latest methods pack a number of quantized data in one vector, but have to pad them with empty bits to avoid overflow to neighbours. We remove even these empty bits and provide full utilization of the vector and memory bandwidth by our data-layout/compute co-design scheme. We implemented FullPack on TFLite for Vector-Matrix multiplication and showed up to \u0000<inline-formula><tex-math>$6.7times$</tex-math></inline-formula>\u0000 speedup, \u0000<inline-formula><tex-math>$2.75times$</tex-math></inline-formula>\u0000 on average on single layers, which translated to \u0000<inline-formula><tex-math>$1.56-2.11times$</tex-math></inline-formula>\u0000 end-to-end speedup on DeepSpeech.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"142-145"},"PeriodicalIF":1.4,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140007796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"JANM-IK: Jacobian Argumented Nelder-Mead Algorithm for Inverse Kinematics and its Hardware Acceleration","authors":"Yuxin Yang;Xiaoming Chen;Yinhe Han","doi":"10.1109/LCA.2024.3369940","DOIUrl":"10.1109/LCA.2024.3369940","url":null,"abstract":"Inverse kinematics is one of the core calculations in robotic applications and has strong performance requirements. Previous hardware acceleration work paid little attention to joint constraints, which can lead to computational failures. We propose a new inverse kinematics algorithm JANM-IK. It uses a hardware-friendly design, optimizes the Jacobian-based method and Nelder-Mead method, realizes the processing of joint constraints, and has a high convergence speed. We further designed its acceleration architecture to achieve high-performance computing through sufficient parallelism and hardware optimization. Finally, after experimental verification, JANM-IK can achieve a very high success rate and obtain certain performance improvements.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"45-48"},"PeriodicalIF":2.3,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139979255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Energy-Efficiency of Capsule Networks on Modern GPUs","authors":"Mohammad Hafezan;Ehsan Atoofian","doi":"10.1109/LCA.2024.3365149","DOIUrl":"10.1109/LCA.2024.3365149","url":null,"abstract":"Convolutional neural networks (CNNs) have become the compelling solution in machine learning applications as they surpass human-level accuracy in a certain set of tasks. Despite the success of CNNs, they classify images based on the identification of specific features, ignoring the spatial relationships between different features due to the pooling layer. The capsule network (CapsNet) architecture proposed by Google Brain's team is an attempt to address this drawback by grouping several neurons into a single capsule and learning the spatial correlations between different input features. Thus, the CapsNet identifies not only the presence of a feature but also its relationship with other features. However, the success of the CapsNet comes at the cost of underutilization of resources when it is run on a modern GPU equipped with tensor cores (TCs). Due to the structure of capsules in the CapsNet, quite often, functional units in a TC are underutilized which prolong the execution of capsule layers and increase energy consumption. In this work, we propose an architecture to eliminate ineffectual operations and improve energy-efficiency of GPUs. Experimental measurements over a set of state-of-the-art datasets show that the proposed approach improves energy-efficiency by 15% while maintaining the accuracy of CapsNets.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"49-52"},"PeriodicalIF":2.3,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139955357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}