Heng Cao;Zhipeng Wu;Dejian Li;Peiguang Jing;Sio Hang Pun;Yu Liu
{"title":"Accelerating Control Flow on CGRAs via Speculative Iteration Execution","authors":"Heng Cao;Zhipeng Wu;Dejian Li;Peiguang Jing;Sio Hang Pun;Yu Liu","doi":"10.1109/LCA.2025.3554777","DOIUrl":"https://doi.org/10.1109/LCA.2025.3554777","url":null,"abstract":"Coarse-Grained Reconfigurable Arrays (CGRAs) offer a promising architecture for accelerating general-purpose, compute-intensive tasks. However, handling control flow within these tasks remains a challenge for CGRAs. Current methods for handling control flow in CGRAs execute condition operations before selecting branch paths, which adds extra execution time. This article proposes a CGRA architecture that decouples the control flow condition and path selection within an iteration through speculative iteration execution (SIE), where the condition is predicted before the start of the current iteration. Compared to existing methods, the SIE CGRA achieves a geometric mean speedup of <inline-formula><tex-math>$1.31times$</tex-math> </inline-formula> over Partial Predication, <inline-formula><tex-math>$1.17 times$</tex-math> </inline-formula> over Dynamic-II Pipeline and <inline-formula><tex-math>$1.12times$</tex-math> </inline-formula> over Dual-Issue Single-Execution.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"109-112"},"PeriodicalIF":1.4,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting Intel AMX Power Gating","authors":"Joshua Kalyanapu;Farshad Dizani;Azam Ghanbari;Darsh Asher;Samira Mirbagher Ajorpaz","doi":"10.1109/LCA.2025.3555183","DOIUrl":"https://doi.org/10.1109/LCA.2025.3555183","url":null,"abstract":"We identify a novel vulnerability in Intel AMX’s dynamic power performance scaling, enabling <sc>NetLoki</small>, a stealthy and high-performance remote speculative attack that bypasses traditional cache defenses and leaks arbitrary addresses over a realistic network where other attacks fail. <sc>NetLoki</small> shows a 34,900% improvement in leakage rate over NetSpectre. We show that <sc>NetLoki</small> evades detection by three state-of-the-art microarchitectural attack detectors (EVAX, PerSpectron, RHMD) and requires a 20,000x reduction in the system’s timer resolution (10 us) than the standard 0.5 ns hardware timer to be mitigated via timer coarsening. Finally, we analyze the root cause of the leakage and propose an effective defense. We show that the mitigation increases CPU power consumption by<monospace> 12.33%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"113-116"},"PeriodicalIF":1.4,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chihun Song;Michael Jaemin Kim;Yan Sun;Houxiang Ji;Kyungsan Kim;TaeKyeong Ko;Jung Ho Ahn;Nam Sung Kim
{"title":"X-PPR: Post Package Repair for CXL Memory","authors":"Chihun Song;Michael Jaemin Kim;Yan Sun;Houxiang Ji;Kyungsan Kim;TaeKyeong Ko;Jung Ho Ahn;Nam Sung Kim","doi":"10.1109/LCA.2025.3552190","DOIUrl":"https://doi.org/10.1109/LCA.2025.3552190","url":null,"abstract":"CXL is an emerging interface that can cost-efficiently expand the capacity and bandwidth of servers, recycling DRAM modules from retired servers. Such DRAM modules, however, will likely have many uncorrectable faulty words due to years of strenuous use in datacenters. To repair faulty words in the field, a few solutions based on Post Package Repair (PPR) and memory offlining have been proposed. Nonetheless, they are either unable to fix thousands of faulty words or prone to causing severe memory fragmentation, as they operate at the granularity of DRAM row and memory page addresses, respectively. In this work, for cost-efficient use of recycled DRAM modules with thousands of faulty words, we propose C<u>X</u>L-<u>PPR</u> (X-PPR), exploiting the CXL’s support for near-memory processing and variable memory access latency. We demonstrate that X-PPR implemented in a commercial CXL device with DDR4 DRAM modules can handle a faulty bit probability that is <inline-formula><tex-math>$3.3 times 10^{4}$</tex-math></inline-formula> higher than ECC for a 512GB DRAM module. Meanwhile, X-PPR negligibly degrades the performance of popular memory-intensive benchmarks, which is achieved through two mechanisms designed in X-PPR to minimize the performance impact of additional DRAM accesses required for repairing faulty words.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"97-100"},"PeriodicalIF":1.4,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143792857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SDT: Cutting Datacenter Tax Through Simultaneous Data-Delivery Threads","authors":"Amin Mamandipoor;Huy Dinh Tran;Mohammad Alian","doi":"10.1109/LCA.2025.3549423","DOIUrl":"https://doi.org/10.1109/LCA.2025.3549423","url":null,"abstract":"Networking is considered a datacenter tax, and hyperscalers push hard to provide high-performance networking with minimal resource expenditure. To keep up with the ever-increasing network rates, many CPU cycles are spent on the networking tax. We make a key observation that network processing threads can be simultaneously executed on server CPUs with minimal interference with the application threads. However, utilizing simultaneous multithreading (SMT) to scale the number of network threads with the number of application threads suffers from (1) failing to provide strict tail latency requirements for latency-critical applications, and (2) reducing the number of available hardware threads for application processes, thus contributing to a high datacenter network tax. In this work, we design, implement, and evaluate a chip-multiprocessor (CMP) with specialized Simultaneous Data-delivery Threads (SDT) per physical core. The key insight is that with judicious partitioning at the architectural level, SDT can safely co-run with application processes with guaranteed performance isolation. Our evaluation results, using full-system simulation, show that a 20-core CMP enhanced with SDT reduces the area and power consumption of a baseline 40-core CMP by 47.5% and 66%, respectively, while reducing network throughput by less than 10%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"93-96"},"PeriodicalIF":1.4,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minseok Seo;Jungi Hyun;Seongho Jeong;Xuan Truong Nguyen;Hyuk-Jae Lee;Hyokeun Lee
{"title":"OASIS: Outlier-Aware KV Cache Clustering for Scaling LLM Inference in CXL Memory Systems","authors":"Minseok Seo;Jungi Hyun;Seongho Jeong;Xuan Truong Nguyen;Hyuk-Jae Lee;Hyokeun Lee","doi":"10.1109/LCA.2025.3567844","DOIUrl":"https://doi.org/10.1109/LCA.2025.3567844","url":null,"abstract":"The key-value (KV) cache in large language models (LLMs) now necessitates a substantial amount of memory capacity as its size proportionally grows with the context’s size. Recently, Compute-Express Link (CXL) memory becomes a promising method to secure memory capacity. However, CXL memory in a GPU-based LLM inference platform entails performance and scalability challenges due to the limited bandwidth of CXL memory. This paper proposes OASIS, an outlier-aware KV cache clustering for scaling LLM inference in CXL memory systems. Our method is based on the observation that clustering is effective in trading off between performance and accuracy compared to previous quantization- or selection-based approaches if clustering is aware of outliers. Our evaluation shows OASIS yields 3.6× speedup compared to the case without clustering while preserving accuracy with just 5% of full KV cache.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"165-168"},"PeriodicalIF":1.4,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data-Pattern-Driven LUT for Efficient In-Cache Computing in CNNs Acceleration","authors":"Zhengpan Fei;Mingchuan Lyu;Satoshi Kawakami;Koji Inoue","doi":"10.1109/LCA.2025.3548080","DOIUrl":"https://doi.org/10.1109/LCA.2025.3548080","url":null,"abstract":"The lookup table (LUT)-based Processing-in-Memory (PIM) solutions perform computations by looking up precomputed results stored in LUTs, providing exceptional efficiency for complex operations such as multiplication, making them highly suitable for energy- and latency-efficient Convolutional Neural Network (CNN) inference tasks. However, including all possible results in the LUT naively demands exponential hardware resources, significantly limiting parallelism and increasing hardware area, latency, and power overhead. While decomposition and compression techniques can reduce the LUT size, they also introduce considerable memory access overhead and additional operations. To address these challenges, we conduct an extensive analysis to identify which data portions significantly impact accuracy in CNNs. Based on the insight that key data is concentrated in a small range, we propose a data-pattern-driven (DPD) optimization strategy, which approximates less critical data to drastically reduce LUT size while preserving computational efficiency with acceptable accuracy loss.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"81-84"},"PeriodicalIF":1.4,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143706788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Vector Permutation Instruction Execution via Controllable Bitonic Network","authors":"Shabirahmed Badashasab Jigalur;Daniel Jiménez Mazure;Teresa Cervero Garcia;Yen-Cheng Kuan","doi":"10.1109/LCA.2025.3548527","DOIUrl":"https://doi.org/10.1109/LCA.2025.3548527","url":null,"abstract":"High-performance computing applications rely heavily on vector instructions to accelerate data processing. In this letter, we propose a controllable bitonic network (CBN) and use it as a lane interconnect to efficiently rearrange data across vector lanes of a vector processing unit to accelerate the execution of vector permutation instructions (VPIs). Our work focuses on the RISC-V vector instruction set because of its configurable vector length support. Through simulations with vector-permutation-intensive applications of a RISC-V vector benchmark suite (RiVEC), the proposed approach with an eight-lane 64-bit CBN demonstrates an average speedup of ≥6× regarding the VPI execution time over a conventional ring-network-based approach. In addition, to verify our approach on hardware, we implemented a processor system with an eight-lane 16-bit CBN on an AMD A7-100T FPGA operating at 20 MHz, demonstrating single-cycle execution of the RISC-V <italic>vr.gather</i> and <italic>vr.scatter</i> instructions.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"133-136"},"PeriodicalIF":1.4,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DPWatch: A Framework for Hardware-Based Differential Privacy Guarantees","authors":"Pawan Kumar Sanjaya;Christina Giannoula;Ian Colbert;Ihab Amer;Mehdi Saeedi;Gabor Sines;Nandita Vijaykumar","doi":"10.1109/LCA.2025.3547262","DOIUrl":"https://doi.org/10.1109/LCA.2025.3547262","url":null,"abstract":"Differential privacy (DP) and federated learning (FL) have emerged as important privacy-preserving approaches when using sensitive data to train machine learning models. FL ensures that raw sensitive data does not leave the users’ devices by training the model in a distributed manner. DP ensures that the model does not leak any information about an individual by <italic>clipping</i> and adding <italic>noise</i> to the gradients. However, real-life deployments of such algorithms assume that the third-party application implementing DP-based FL is trusted, and is thus given access to sensitive data on the data owner’s device/server. In this work, we propose DPWatch, a hardware-based framework for ML accelerators that enforces guarantees that a third party application cannot leak sensitive user data used for training and ensures that the gradients are appropriately noised before leaving the device. We evaluate DPWatch on two accelerators and demonstrate small area and performance overheads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"89-92"},"PeriodicalIF":1.4,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143761420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Amethyst: Reducing Data Center Emissions With Dynamic Autotuning and VM Management","authors":"Mattia Tibaldi;Christian Pilato","doi":"10.1109/LCA.2025.3566553","DOIUrl":"https://doi.org/10.1109/LCA.2025.3566553","url":null,"abstract":"To reduce emerging carbon emissions in cloud computing, we proposed Amethyst, a new VM placement and migration strategy capable of adapting consumption to the currently available green energy. Amethyst tackles the problem on three fronts: it adjusts the consumption to energy production, optimizes execution on FPGA accelerators, and balances execution among servers. We evaluate the strategy with real workloads. Our simulations on CloudSim Plus show that Amethyst effectively reduces the carbon emissions of cloud computing and, compared to the state-of-the-art, it increases the energy efficiency.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"153-156"},"PeriodicalIF":1.4,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Characterization of Generative Recommendation Models: Study of Hierarchical Sequential Transduction Unit","authors":"Taehun Kim;Yunjae Lee;Juntaek Lim;Minsoo Rhu","doi":"10.1109/LCA.2025.3546811","DOIUrl":"https://doi.org/10.1109/LCA.2025.3546811","url":null,"abstract":"Recommendation systems are crucial for personalizing user experiences on online platforms. While Deep Learning Recommendation Models (DLRMs) have been the state-of-the-art for nearly a decade, their scalability is limited, as model quality scales poorly with compute. Recently, there have been research efforts applying Transformer architecture to recommendation systems, and Hierarchical Sequential Transaction Unit (HSTU), an encoder architecture, has been proposed to address scalability challenges. Although HSTU-based generative recommenders show significant potential, they have received little attention from computer architects. In this paper, we analyze the inference process of HSTU-based generative recommenders and perform an in-depth characterization of the model. Our findings indicate the attention mechanism is a major performance bottleneck. We further discuss promising research directions and optimization strategies that can potentially enhance the efficiency of HSTU models.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"85-88"},"PeriodicalIF":1.4,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143706790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}