Andreas Kosmas Kakolyris;Dimosthenis Masouros;Sotirios Xydis;Dimitrios Soudris
{"title":"SLO-Aware GPU DVFS for Energy-Efficient LLM Inference Serving","authors":"Andreas Kosmas Kakolyris;Dimosthenis Masouros;Sotirios Xydis;Dimitrios Soudris","doi":"10.1109/LCA.2024.3406038","DOIUrl":"10.1109/LCA.2024.3406038","url":null,"abstract":"The increasing popularity of LLM-based chatbots combined with their reliance on power-hungry GPU infrastructure forms a critical challenge for providers: minimizing energy consumption under Service-Level Objectives (SLOs) that ensure optimal user experience. Traditional energy optimization methods fall short for LLM inference due to their autoregressive architecture, which renders them incapable of meeting a predefined SLO without \u0000<italic>energy overprovisioning</i>\u0000. This autoregressive nature however, allows for iteration-level adjustments, enabling continuous fine-tuning of the system throughout the inference process. In this letter, we propose a solution based on iteration-level GPU Dynamic Voltage Frequency Scaling (DVFS), aiming to reduce the energy impact of LLM serving, an approach that has the potential for more than 22.8% and up to 45.5% energy gains when tested in real world situations under varying SLO constraints. Our approach works on top of existing LLM hosting services, requires minimal profiling and no intervention to the inference serving system.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"150-153"},"PeriodicalIF":1.4,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141188853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongmo Park;Subhankar Pal;Aporva Amarnath;Karthik Swaminathan;Wei D. Lu;Alper Buyuktosunoglu;Pradip Bose
{"title":"Dramaton: A Near-DRAM Accelerator for Large Number Theoretic Transforms","authors":"Yongmo Park;Subhankar Pal;Aporva Amarnath;Karthik Swaminathan;Wei D. Lu;Alper Buyuktosunoglu;Pradip Bose","doi":"10.1109/LCA.2024.3381452","DOIUrl":"10.1109/LCA.2024.3381452","url":null,"abstract":"With the rising popularity of post-quantum cryptographic schemes, realizing practical implementations for real-world applications is still a major challenge. A major bottleneck in such schemes is the fetching and processing of large polynomials in the Number Theoretic Transform (NTT), which makes non Von Neumann paradigms, such as near-memory processing, a viable option. We, therefore, propose a novel near-DRAM NTT accelerator design, called \u0000<sc>Dramaton</small>\u0000. Additionally, we introduce a conflict-free mapping algorithm that enables \u0000<sc>Dramaton</small>\u0000 to process large NTTs with minimal hardware overhead using a fixed-permutation network. \u0000<sc>Dramaton</small>\u0000 achieves 5–207× speedup in latency over the state-of-the-art and 97× improvement in EDP over a recent near-memory NTT accelerator.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"108-111"},"PeriodicalIF":2.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erika S. Alcorta;Mahesh Madhav;Richard Afoakwa;Scott Tetrick;Neeraja J. Yadwadkar;Andreas Gerstlauer
{"title":"Characterizing Machine Learning-Based Runtime Prefetcher Selection","authors":"Erika S. Alcorta;Mahesh Madhav;Richard Afoakwa;Scott Tetrick;Neeraja J. Yadwadkar;Andreas Gerstlauer","doi":"10.1109/LCA.2024.3404887","DOIUrl":"10.1109/LCA.2024.3404887","url":null,"abstract":"Modern computer designs support composite prefetching, where multiple prefetcher components are used to target different memory access patterns. However, multiple prefetchers competing for resources can sometimes hurt performance, especially in many-core systems where cache and other resources are limited. Recent work has proposed mitigating this issue by selectively enabling and disabling prefetcher components at runtime. Formulating the problem with machine learning (ML) methods is promising, but efficient and effective solutions in terms of cost and performance are not well understood. This work studies fundamental characteristics of the composite prefetcher selection problem through the lens of ML to inform future prefetcher selection designs. We show that prefetcher decisions do not have significant temporal dependencies, that a phase-based rather than sample-based definition of ground truth yields patterns that are easier to learn, and that prefetcher selection can be formulated as a workload-agnostic problem requiring little to no training at runtime.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"146-149"},"PeriodicalIF":1.4,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141171208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyungyo Kim;Gaohan Ye;Nachuan Wang;Amir Yazdanbakhsh;Nam Sung Kim
{"title":"Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference","authors":"Hyungyo Kim;Gaohan Ye;Nachuan Wang;Amir Yazdanbakhsh;Nam Sung Kim","doi":"10.1109/LCA.2024.3397747","DOIUrl":"10.1109/LCA.2024.3397747","url":null,"abstract":"The ever-increasing number of parameters in Large Language Models (LLMs) demands many expensive GPUs for both inference and training. This is because even such a high-end GPU such as NVIDIA A100 can store only a subset of parameters due to its limited memory capacity. To reduce the number of required GPUs, especially for inference, we may exploit the large memory capacity of (host) CPU to store not only all the model parameters but also intermediate outputs which also require a substantial memory capacity. However, this necessitates frequent data transfers between CPU and GPU over the slow PCIe interface, creating a bottleneck that hinders the accomplishment of both low latency and high throughput in inference. To address such a challenge, we first propose CPU-GPU cooperative computing that exploits the Advanced Matrix Extensions (AMX) capability of the latest Intel CPU, codenamed Sapphire Rapids (SPR). Second, we propose an adaptive model partitioning policy that determines the layers of a given LLM to be run on CPU and GPU, respectively, based on their memory capacity requirement and arithmetic intensity. As CPU executes the layers with large memory capacity but low arithmetic intensity, the amount of data transferred through the PCIe interface is significantly reduced, thereby improving the LLM inference performance. Our evaluation demonstrates that CPU-GPU cooperative computing, based on this policy, delivers 12.1× lower latency and 5.4× higher throughput than GPU-only computing for OPT-30B inference when both CPU-GPU and GPU-only computing store the model in CPU memory.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"117-120"},"PeriodicalIF":2.3,"publicationDate":"2024-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10538369","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141146488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Approximate Multiplier Design With LFSR-Based Stochastic Sequence Generators for Edge AI","authors":"Mrinmay Sasmal;Tresa Joseph;Bindiya T. S.","doi":"10.1109/LCA.2024.3379002","DOIUrl":"10.1109/LCA.2024.3379002","url":null,"abstract":"This letter introduces an innovative approximate multiplier (AM) architecture that leverages stochastically generated bit streams through the Linear Feedback Shift Register (LFSR). The AM is applied to matrix-vector multiplication (MVM) in Neural Networks (NNs). The hardware implementations in 90 nm CMOS technology demonstrate superior power and area efficiency compared to state-of-the-art designs. Additionally, the study explores applying stochastic computing to LSTM NNs, showcasing improved energy efficiency and speed.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"91-94"},"PeriodicalIF":2.3,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140169772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hashing ATD Tags for Low-Overhead Safe Contention Monitoring","authors":"Pablo Andreu;Pedro Lopez;Carles Hernandez","doi":"10.1109/LCA.2024.3401570","DOIUrl":"10.1109/LCA.2024.3401570","url":null,"abstract":"Increasing the performance of safety-critical systems via introducing multicore processors is becoming the norm. However, when multiple cores access a shared cache, inter-core evictions become a relevant source of interference that must be appropriately controlled. To solve this issue, one can statically partition caches and remove the interference. Unfortunately, this comes at the expense of less flexibility and, in some cases, worse performance. In this context, enabling more flexible cache allocation policies requires additional monitoring support. This paper proposes HashTAG, a novel approach to accurately upper-bound inter-core eviction interference. HashTAG enables a low-overhead implementation of an Auxiliary Tag Directory to determine inter-core evictions. Our results show that no inter-task interference underprediction is possible with HashTAG while providing a 44% reduction in ATD area with only 1.14% median overprediction.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"166-169"},"PeriodicalIF":1.4,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10530895","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141063379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changmin Shin;Taehee Kwon;Jaeyong Song;Jae Hyung Ju;Frank Liu;Yeonkyu Choi;Jinho Lee
{"title":"A Case for In-Memory Random Scatter-Gather for Fast Graph Processing","authors":"Changmin Shin;Taehee Kwon;Jaeyong Song;Jae Hyung Ju;Frank Liu;Yeonkyu Choi;Jinho Lee","doi":"10.1109/LCA.2024.3376680","DOIUrl":"10.1109/LCA.2024.3376680","url":null,"abstract":"Because of the widely recognized memory wall issue, modern DRAMs are increasingly being assigned innovative functionalities beyond the basic read and write operations. Often referred to as “function-in-memory”, these techniques are crafted to leverage the abundant internal bandwidth available within the DRAM. However, these techniques face several challenges, including requiring large areas for arithmetic units and the necessity of splitting a single word into multiple pieces. These challenges severely limit the practical application of these function-in-memory techniques. In this paper, we present Piccolo, an efficient design of random scatter-gather memory. Our method achieves significant improvements with minimal overhead. By demonstrating our technique on a graph processing accelerator, we show that Piccolo and the proposed accelerator achieves \u0000<inline-formula><tex-math>$1.2-3.1 times$</tex-math></inline-formula>\u0000 speedup compared to the prior art.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"73-77"},"PeriodicalIF":2.3,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140124987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SparseLeakyNets: Classification Prediction Attack Over Sparsity-Aware Embedded Neural Networks Using Timing Side-Channel Information","authors":"Saurav Maji;Kyungmi Lee;Anantha P. Chandrakasan","doi":"10.1109/LCA.2024.3397730","DOIUrl":"10.1109/LCA.2024.3397730","url":null,"abstract":"This letter explores security vulnerabilities in sparsity-aware optimizations for Neural Network (NN) platforms, specifically focusing on timing side-channel attacks introduced by optimizations such as skipping sparse multiplications. We propose a classification prediction attack that utilizes this timing side-channel information to mimic the NN's prediction outcomes. Our techniques were demonstrated for CIFAR-10, MNIST, and biomedical classification tasks using diverse dataflows and processing loads in timing models. The demonstrated results could predict the original classification decision with high accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"133-136"},"PeriodicalIF":2.3,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140925787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deepanjali Mishra;Konstantinos Kanellopoulos;Ashish Panwar;Akshitha Sriraman;Vivek Seshadri;Onur Mutlu;Todd C. Mowry
{"title":"Address Scaling: Architectural Support for Fine-Grained Thread-Safe Metadata Management","authors":"Deepanjali Mishra;Konstantinos Kanellopoulos;Ashish Panwar;Akshitha Sriraman;Vivek Seshadri;Onur Mutlu;Todd C. Mowry","doi":"10.1109/LCA.2024.3373760","DOIUrl":"10.1109/LCA.2024.3373760","url":null,"abstract":"In recent decades, software systems have grown significantly in size and complexity. As a result, such systems are more prone to bugs which can cause performance and correctness challenges. Using run-time monitoring tools is one approach to mitigate these challenges. However, these tools maintain metadata for every byte of application data they monitor, which precipitates performance overheads from additional metadata accesses. We propose \u0000<italic>Address Scaling</i>\u0000, a new hardware framework that performs fine-grained metadata management to reduce metadata access overheads in run-time monitoring tools. Our mechanism is based on the observation that different run-time monitoring tools maintain metadata at varied granularities. Our key insight is to maintain the data and its corresponding metadata within the same cache line, to preserve locality. \u0000<italic>Address Scaling</i>\u0000 improves the performance of \u0000<monospace>Memcheck</monospace>\u0000, a dynamic monitoring tool that detects memory-related errors, by 3.55× and 6.58× for sequential and random memory access patterns respectively, compared to the state-of-the-art systems that store the metadata in a memory region that is separate from the data.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"69-72"},"PeriodicalIF":2.3,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140057254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ali Mohammadpur-Fard;Sina Darabi;Hajar Falahati;Negin Mahani;Hamid Sarbazi-Azad
{"title":"Exploiting Direct Memory Operands in GPU Instructions","authors":"Ali Mohammadpur-Fard;Sina Darabi;Hajar Falahati;Negin Mahani;Hamid Sarbazi-Azad","doi":"10.1109/LCA.2024.3371062","DOIUrl":"10.1109/LCA.2024.3371062","url":null,"abstract":"GPUs are widely used for diverse applications, particularly data-parallel tasks like machine learning and scientific computing. However, their efficiency is hindered by architectural limitations, inherited from historical RISC processors, in handling memory loads causing high register file contention. We observe that a significant number (around 26%) of values present in the register file are typically used only once, contributing to more than 25% of the total register file bank conflicts, on average. This paper addresses the challenge of single-use memory values in the GPU register file (i.e. data values used only once) which wastes space and increases latency. To this end, we introduce a novel mechanism inspired by CISC architectures. It replaces single-use loads with direct memory operands in arithmetic operations. Our approach improves performance by 20% and reduces energy consumption by 18%, on average, with negligible (<1%) hardware overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"162-165"},"PeriodicalIF":1.4,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140047828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}