Daniel Puckett;Tyler Tomer;Paul V. Gratz;Jiang Hu;Galen Shipman;Jered Dominguez-Trujillo;Kevin Sheridan
{"title":"Estimating CPI Stacks From Multiplexed Performance Counter Data Using Machine Learning","authors":"Daniel Puckett;Tyler Tomer;Paul V. Gratz;Jiang Hu;Galen Shipman;Jered Dominguez-Trujillo;Kevin Sheridan","doi":"10.1109/LCA.2025.3556644","DOIUrl":"https://doi.org/10.1109/LCA.2025.3556644","url":null,"abstract":"Optimizing software at runtime is much easier with a clear understanding of the bottlenecks facing the software. CPI stacks are a common method of visualizing these bottlenecks. However, existing proposals to implement CPI stacks require hardware modifications. To compute CPI stacks without modifying the CPU, we demonstrate CPI stacks can be estimated from existing performance counters using machine learning.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"129-132"},"PeriodicalIF":1.4,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heng Cao;Zhipeng Wu;Dejian Li;Peiguang Jing;Sio Hang Pun;Yu Liu
{"title":"Accelerating Control Flow on CGRAs via Speculative Iteration Execution","authors":"Heng Cao;Zhipeng Wu;Dejian Li;Peiguang Jing;Sio Hang Pun;Yu Liu","doi":"10.1109/LCA.2025.3554777","DOIUrl":"https://doi.org/10.1109/LCA.2025.3554777","url":null,"abstract":"Coarse-Grained Reconfigurable Arrays (CGRAs) offer a promising architecture for accelerating general-purpose, compute-intensive tasks. However, handling control flow within these tasks remains a challenge for CGRAs. Current methods for handling control flow in CGRAs execute condition operations before selecting branch paths, which adds extra execution time. This article proposes a CGRA architecture that decouples the control flow condition and path selection within an iteration through speculative iteration execution (SIE), where the condition is predicted before the start of the current iteration. Compared to existing methods, the SIE CGRA achieves a geometric mean speedup of <inline-formula><tex-math>$1.31times$</tex-math> </inline-formula> over Partial Predication, <inline-formula><tex-math>$1.17 times$</tex-math> </inline-formula> over Dynamic-II Pipeline and <inline-formula><tex-math>$1.12times$</tex-math> </inline-formula> over Dual-Issue Single-Execution.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"109-112"},"PeriodicalIF":1.4,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Approximate SFQ-Based Computing Architecture Modeling With Device-Level Guidelines","authors":"Pratiksha Mundhe;Yuta Hano;Satoshi Kawakami;Teruo Tanimoto;Masamitsu Tanaka;Koji Inoue;Ilkwon Byun","doi":"10.1109/LCA.2025.3573740","DOIUrl":"https://doi.org/10.1109/LCA.2025.3573740","url":null,"abstract":"Single-flux-quantum (SFQ) logic has emerged as a promising post-Moore technology thanks to its ultra-fast and low-energy operation. However, despite progress in various fields, its feasibility is questionable due to the prohibitive cooling cost. Proven conventional ideas, such as approximate computing, may help to resolve this challenge. However, introducing such ideas has been impossible due to the complex performance, power, and error trade-offs originating from the unique SFQ device characteristics. This work introduces approximate SFQ-based computing (AxSFQ) with an architecture modeling framework and essential design guidelines. Our optimized device-level AxSFQ showcases 30–100 times energy efficiency improvement, which motivates further circuit and architecture-level exploration.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"253-256"},"PeriodicalIF":1.4,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting Intel AMX Power Gating","authors":"Joshua Kalyanapu;Farshad Dizani;Azam Ghanbari;Darsh Asher;Samira Mirbagher Ajorpaz","doi":"10.1109/LCA.2025.3555183","DOIUrl":"https://doi.org/10.1109/LCA.2025.3555183","url":null,"abstract":"We identify a novel vulnerability in Intel AMX’s dynamic power performance scaling, enabling <sc>NetLoki</small>, a stealthy and high-performance remote speculative attack that bypasses traditional cache defenses and leaks arbitrary addresses over a realistic network where other attacks fail. <sc>NetLoki</small> shows a 34,900% improvement in leakage rate over NetSpectre. We show that <sc>NetLoki</small> evades detection by three state-of-the-art microarchitectural attack detectors (EVAX, PerSpectron, RHMD) and requires a 20,000x reduction in the system’s timer resolution (10 us) than the standard 0.5 ns hardware timer to be mitigated via timer coarsening. Finally, we analyze the root cause of the leakage and propose an effective defense. We show that the mitigation increases CPU power consumption by<monospace> 12.33%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"113-116"},"PeriodicalIF":1.4,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chihun Song;Michael Jaemin Kim;Yan Sun;Houxiang Ji;Kyungsan Kim;TaeKyeong Ko;Jung Ho Ahn;Nam Sung Kim
{"title":"X-PPR: Post Package Repair for CXL Memory","authors":"Chihun Song;Michael Jaemin Kim;Yan Sun;Houxiang Ji;Kyungsan Kim;TaeKyeong Ko;Jung Ho Ahn;Nam Sung Kim","doi":"10.1109/LCA.2025.3552190","DOIUrl":"https://doi.org/10.1109/LCA.2025.3552190","url":null,"abstract":"CXL is an emerging interface that can cost-efficiently expand the capacity and bandwidth of servers, recycling DRAM modules from retired servers. Such DRAM modules, however, will likely have many uncorrectable faulty words due to years of strenuous use in datacenters. To repair faulty words in the field, a few solutions based on Post Package Repair (PPR) and memory offlining have been proposed. Nonetheless, they are either unable to fix thousands of faulty words or prone to causing severe memory fragmentation, as they operate at the granularity of DRAM row and memory page addresses, respectively. In this work, for cost-efficient use of recycled DRAM modules with thousands of faulty words, we propose C<u>X</u>L-<u>PPR</u> (X-PPR), exploiting the CXL’s support for near-memory processing and variable memory access latency. We demonstrate that X-PPR implemented in a commercial CXL device with DDR4 DRAM modules can handle a faulty bit probability that is <inline-formula><tex-math>$3.3 times 10^{4}$</tex-math></inline-formula> higher than ECC for a 512GB DRAM module. Meanwhile, X-PPR negligibly degrades the performance of popular memory-intensive benchmarks, which is achieved through two mechanisms designed in X-PPR to minimize the performance impact of additional DRAM accesses required for repairing faulty words.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"97-100"},"PeriodicalIF":1.4,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143792857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeongho Lee;Sangjun Kim;Jaeyong Lee;Jaeyoung Kang;Sungjin Lee;Nam Sung Kim;Jihong Kim
{"title":"srNAND: A Novel NAND Flash Organization for Enhanced Small Read Throughput in SSDs","authors":"Jeongho Lee;Sangjun Kim;Jaeyong Lee;Jaeyoung Kang;Sungjin Lee;Nam Sung Kim;Jihong Kim","doi":"10.1109/LCA.2025.3571321","DOIUrl":"https://doi.org/10.1109/LCA.2025.3571321","url":null,"abstract":"Emerging data-intensive applications with frequent small random read operations challenge the throughput capabilities of conventional SSD architectures. Although Compute Express Link enabled SSDs allow for fine-grained data access with reduced latency, their read throughput remains limited by legacy block-oriented designs. To address this, we propose <inline-formula><tex-math>${sf srNAND}$</tex-math></inline-formula>, an advanced NAND flash architecture for CXL SSDs. It uses a two-stage ECC decoding mechanism to reduce read amplification, an optimized read command sequence to boost parallelism, and a request merging module to eliminate redundant operations. Our evaluation shows that <inline-formula><tex-math>${sf srSSD}$</tex-math></inline-formula> can improve read throughput by up to 10.4× compared to conventional CXL SSDs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"197-200"},"PeriodicalIF":1.4,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144536557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DynaFlow: An ML Framework for Dynamic Dataflow Selection in SpGEMM Accelerators","authors":"Sanjali Yadav;Bahar Asgari","doi":"10.1109/LCA.2025.3570667","DOIUrl":"https://doi.org/10.1109/LCA.2025.3570667","url":null,"abstract":"Sparse matrix-matrix multiplication (SpGEMM) is a critical operation in numerous fields, including scientific computing, graph analytics, and deep learning, leveraging matrix sparsity to reduce both storage and computation costs. However, the irregular structure of sparse matrices poses significant challenges for performance optimization. Existing hardware accelerators often employ fixed dataflows designed for specific sparsity patterns, leading to performance degradation when the input deviates from these assumptions. As SpGEMM adoption expands across a broad spectrum of sparsity workloads, the demand grows for accelerators capable of dynamically adapting their dataflow schemes to diverse sparsity patterns. To address this, we propose DynaFlow, a machine learning-based framework that trains on the set of dataflows supported by any given accelerator and learns to predict the optimal dataflow based on the input sparsity pattern. By leveraging decision trees and deep reinforcement learning, DynaFlow surpasses static dataflow selection approaches, achieving up to a 50× speedup.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"189-192"},"PeriodicalIF":1.4,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144205869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cosmos: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search","authors":"Seoyoung Ko;Hyunjeong Shim;Wanju Doh;Sungmin Yun;Jinin So;Yongsuk Kwon;Sang-Soo Park;Si-Dong Roh;Minyong Yoon;Taeksang Song;Jung Ho Ahn","doi":"10.1109/LCA.2025.3570235","DOIUrl":"https://doi.org/10.1109/LCA.2025.3570235","url":null,"abstract":"Retrieval-Augmented Generation (RAG) is crucial for improving the quality of large language models by injecting proper contexts extracted from external sources. RAG requires high-throughput, low-latency Approximate Nearest Neighbor Search (ANNS) over billion-scale vector databases. Conventional DRAM/SSD solutions face capacity/latency limits, whereas specialized hardware or RDMA clusters lack flexibility or incur network overhead. We present <sc>Cosmos</small>, integrating general-purpose cores within CXL memory devices for full ANNS offload and introducing rank-level parallel distance computation to maximize memory bandwidth. We also propose an adjacency-aware data placement that balances search loads across CXL devices based on inter-cluster proximity. Evaluations on SIFT1B and DEEP1B traces show that <sc>Cosmos</small> achieves up to 6.72× higher throughput than the baseline CXL system and 2.35× over a state-of-the-art CXL-based solution, demonstrating scalability for RAG pipelines.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"173-176"},"PeriodicalIF":1.4,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144196731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Minimal Counters, Maximum Insight: Simplifying System Performance With HPC Clusters for Optimized Monitoring","authors":"Shubhi Shukla;Abhijeet Singh;Rajdeep Chakraborty;Anirban Chakraborty;Tejas Rathod;Harshal Mumbaikar;Manoj Kumar Munigala;Madhusudhan K N;Pabitra Mitra;Debdeep Mukhopadhyay","doi":"10.1109/LCA.2025.3570157","DOIUrl":"https://doi.org/10.1109/LCA.2025.3570157","url":null,"abstract":"As computer systems become more complex, evaluating performance requires tracking various hardware performance counters that capture the system’s internal activities. While these counters provide valuable insights, their growing number makes it challenging to identify the most relevant ones for performance analysis. In this paper, we investigate the correlation between performance counter values and overall system performance, while also exploring the inter-correlation between different counters. Our findings demonstrate that specific counters are strongly correlated with key performance metrics and that significant redundancy exists among counters. By leveraging these relationships, we propose a method for selecting a small, representative set of performance counters. This streamlined set can further be used to accurately predict performance score across various workloads and system configurations.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"177-180"},"PeriodicalIF":1.4,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144196730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SDT: Cutting Datacenter Tax Through Simultaneous Data-Delivery Threads","authors":"Amin Mamandipoor;Huy Dinh Tran;Mohammad Alian","doi":"10.1109/LCA.2025.3549423","DOIUrl":"https://doi.org/10.1109/LCA.2025.3549423","url":null,"abstract":"Networking is considered a datacenter tax, and hyperscalers push hard to provide high-performance networking with minimal resource expenditure. To keep up with the ever-increasing network rates, many CPU cycles are spent on the networking tax. We make a key observation that network processing threads can be simultaneously executed on server CPUs with minimal interference with the application threads. However, utilizing simultaneous multithreading (SMT) to scale the number of network threads with the number of application threads suffers from (1) failing to provide strict tail latency requirements for latency-critical applications, and (2) reducing the number of available hardware threads for application processes, thus contributing to a high datacenter network tax. In this work, we design, implement, and evaluate a chip-multiprocessor (CMP) with specialized Simultaneous Data-delivery Threads (SDT) per physical core. The key insight is that with judicious partitioning at the architectural level, SDT can safely co-run with application processes with guaranteed performance isolation. Our evaluation results, using full-system simulation, show that a 20-core CMP enhanced with SDT reduces the area and power consumption of a baseline 40-core CMP by 47.5% and 66%, respectively, while reducing network throughput by less than 10%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"93-96"},"PeriodicalIF":1.4,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}