{"title":"Editorial: A Letter From the Editor-in-Chief of IEEE Computer Architecture Letters","authors":"Sudhanva Gurumurthi;Mattan Erez","doi":"10.1109/LCA.2025.3528276","DOIUrl":"https://doi.org/10.1109/LCA.2025.3528276","url":null,"abstract":"","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"iii-iv"},"PeriodicalIF":1.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10856691","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SPAM: Streamlined Prefetcher-Aware Multi-Threaded Cache Covert-Channel Attack","authors":"E. Kritheesh;Biswabandan Panda","doi":"10.1109/LCA.2025.3529213","DOIUrl":"https://doi.org/10.1109/LCA.2025.3529213","url":null,"abstract":"Last-level cache (LLC) covert-channels exploit the cache timing differences to transmit information. In recent works, the attacks rely on a single sender and a single receiver. Streamline is the state-of-the-art cache covert channel attack that uses a shared array of addresses mapped to the payload bits, allowing parallelization of the encoding and decoding of bits. As multi-core systems are ubiquitous, multiple senders and receivers can be used to create a high bandwidth cache covert channel. However, this is not the case, and the bandwidth per thread is limited by various factors. We extend Streamline to a multi-threaded Streamline, where the senders buffer a few thousand bits at the LLC for the receivers to decode. We observe that these buffered bits are prone to eviction by the co-running processes before they are decoded. We propose SPAM, a multi-threaded covert-channel at the LLC. SPAM shows that fewer but faster senders must encode for more receivers to reduce this time frame. This ensures resilience to noise coming from cache activities of co-running applications. SPAM uses two different access patterns for the sender(s) and the receiver(s). The sender access pattern of the addresses is modified to leverage the hardware prefetchers to accelerate the loads while encoding. The receiver access pattern circumvents the hardware prefetchers for accurate load latency measurements. We demonstrate SPAM on a six-core (12-threaded) system, achieving a bit-rate of 12.21 MB/s at an error rate of 9.02% which is an improvement of over 70% over the state-of-the-art multi-threaded Streamline for comparable error rates when 50% of the co-running threads stress the cache system.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"25-28"},"PeriodicalIF":1.4,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Houxiang Ji;Minho Kim;Seonmu Oh;Daehoon Kim;Nam Sung Kim
{"title":"Cooperative Memory Deduplication With Intel Data Streaming Accelerator","authors":"Houxiang Ji;Minho Kim;Seonmu Oh;Daehoon Kim;Nam Sung Kim","doi":"10.1109/LCA.2025.3527458","DOIUrl":"https://doi.org/10.1109/LCA.2025.3527458","url":null,"abstract":"Memory deduplication plays a critical role in reducing memory consumption and the total cost of ownership (TCO) in hyperscalers, particularly as the advent of large language models imposes unprecedented demands on memory resources. However, conventional CPU-based memory deduplication can interfere with co-running applications, significantly impacting the performance of time-sensitive workloads. Intel introduced the <italic>on-chip</i> Data Streaming Accelerator (DSA), providing high-performance data movement and transformation capabilities, including comparison and checksum calculation, which are heavily utilized in the deduplication. In this work, we enhance a widely-used kernel-space memory deduplication feature, Kernel Samepage Merging (<monospace>ksm</monospace>), by selectively offloading these operations to the DSA. Our evaluation demonstrates that CPU-based <monospace>ksm</monospace> can lead to 5.0–10.9× increase in the tail latency of co-running applications while DSA-based <monospace>ksm</monospace> limits the latency increase to just 1.6× while achieving comparable memory savings.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"29-32"},"PeriodicalIF":1.4,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Performance Winograd Based Accelerator Architecture for Convolutional Neural Network","authors":"Vardhana M;Rohan Pinto","doi":"10.1109/LCA.2025.3525970","DOIUrl":"https://doi.org/10.1109/LCA.2025.3525970","url":null,"abstract":"Convolutional Neural Networks are deployed mostly on GPUs or CPUs. However, due to the increasing complexity of architecture and growing performance requirements, these platforms may not be suitable for deploying inference engines. ASIC and FPGA implementations are appearing as superior alternatives to software-based solutions for achieving the required performance. In this article, an efficient architecture for accelerating convolution using the Winograd transform is proposed and implemented on FPGA. The proposed accelerator consumes 38% less resources as compared with conventional GEMM-based implementation. Analysis results indicate that our accelerator can achieve 3.5 TOP/s, 1.28 TOP/s, and 1.42 TOP/s for VGG16, ResNet18, and MobileNetV2 CNNs, respectively, at 250 MHz. The proposed accelerator demonstrates the best energy efficiency as compared with prior arts.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"21-24"},"PeriodicalIF":1.4,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PINSim: A Processing In- and Near-Sensor Simulator to Model Intelligent Vision Sensors","authors":"Sepehr Tabrizchi;Mehrdad Morsali;David Pan;Shaahin Angizi;Arman Roohi","doi":"10.1109/LCA.2024.3522777","DOIUrl":"https://doi.org/10.1109/LCA.2024.3522777","url":null,"abstract":"This letter introduces PINSim, a user-friendly and flexible framework for simulating emerging smart vision sensors in the early design stages. PINSim enables the realization of integrated sensing and processing near and in the sensor, effectively addressing challenges such as data movement and power-hungry analog-to-digital converters. The framework offers a flexible interface and a wide range of design options for customizing the efficiency and accuracy of processing-near/in-sensor-based accelerators using a hierarchical structure. Its organization spans from the device level upward to the algorithm level. PINSim realizes instruction-accurate evaluation of circuit-level performance metrics. PINSim achieves over <inline-formula><tex-math>$25,000times$</tex-math></inline-formula> speed-up compared to SPICE simulation with less than a 4.1% error rate on average. Furthermore, it supports both multilayer perceptron (MLP) and convolutional neural network (CNN) models, with limitations determined by IoT budget constraints. By facilitating the exploration and optimization of various design parameters, PiNSim empowers researchers and engineers to develop energy-efficient and high-performance smart vision sensors for a wide range of applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"17-20"},"PeriodicalIF":1.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ZoneBuffer: An Efficient Buffer Management Scheme for ZNS SSDs","authors":"Hongtao Wang;Peiquan Jin","doi":"10.1109/LCA.2024.3498103","DOIUrl":"https://doi.org/10.1109/LCA.2024.3498103","url":null,"abstract":"The introduction of Zoned Namespace SSDs (ZNS SSDs) presents new challenges for existing buffer management schemes. In addition to traditional SSD characteristics such as read/write asymmetry and limited write endurance, ZNS SSDs possess unique constraints, such as requiring sequential writes within each zone. These features make conventional buffering policies incompatible with ZNS SSDs. This paper introduces ZoneBuffer, a novel buffering scheme designed specifically for ZNS SSDs. ZoneBuffer's innovation lies in two key aspects. First, it introduces a new buffer structure comprising a Work Region and a Priority Region. The Priority Region is further divided into a clean page queue and a zone cluster of dirty pages. By confining buffer replacement to the Priority Region, ZoneBuffer ensures optimization for ZNS SSDs. Second, ZoneBuffer incorporates a lifetime-based clustering algorithm to group dirty pages within the Priority Region, optimizing write operations. Preliminary experiments conducted on a real ZNS SSD demonstrate the effectiveness of ZoneBuffer. Compared with conventional schemes like LRU and CFLRU, the results indicate that ZoneBuffer significantly improves performance.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"239-242"},"PeriodicalIF":1.4,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142825861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Myoungjun Chun;Jaeyong Lee;Inhyuk Choi;Jisung Park;Myungsuk Kim;Jihong Kim
{"title":"Straw: A Stress-Aware WL-Based Read Reclaim Technique for High-Density NAND Flash-Based SSDs","authors":"Myoungjun Chun;Jaeyong Lee;Inhyuk Choi;Jisung Park;Myungsuk Kim;Jihong Kim","doi":"10.1109/LCA.2024.3516205","DOIUrl":"https://doi.org/10.1109/LCA.2024.3516205","url":null,"abstract":"Although read disturbance has emerged as a major reliability concern, managing read disturbance in modern NAND flash memory has not been thoroughly investigated yet. From a device characterization study using real modern NAND flash memory, we observe that reading a page incurs heterogeneous reliability impacts on each WL, which makes the existing block-level read reclaim extremely inefficient. We propose a new WL-level read-reclaim technique, called \u0000<sc>Straw</small>\u0000, which keeps track of the accumulated read-disturbance effect on each WL and reclaims only heavily-disturbed WLs. By avoiding unnecessary read-reclaim operations, \u0000<sc>Straw</small>\u0000 reduces read-reclaim-induced page writes by 83.6% with negligible storage overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"5-8"},"PeriodicalIF":1.4,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Electra: Eliminating the Ineffectual Computations on Bitmap Compressed Matrices","authors":"Chaithanya Krishna Vadlamudi;Bahar Asgari","doi":"10.1109/LCA.2024.3516057","DOIUrl":"https://doi.org/10.1109/LCA.2024.3516057","url":null,"abstract":"The primary computations in several applications, such as deep learning recommendation models, graph neural networks, and scientific computing, involve sparse matrix sparse matrix multiplications (SpMSpM). Unlike standard multiplications, SpMSpMs introduce ineffective computations that can negatively impact performance. While several accelerators have been proposed to execute SpMSpM more efficiently, they often incur additional overhead in identifying the effectual arithmetic computations. To solve this issue, we propose Electra, a novel approach designed to reduce ineffectual computations in bitmap-compressed matrices. Electra achieves this by i) performing logical operations on the bitmap data to know whether the arithmetic computation has a zero or non-zero value, and ii) implementing finer granular scheduling of non-zero elements to arithmetic units. Our evaluations suggest that on average, Electra achieves a speedup of 1.27× over the state-of-the-art SpMSpM accelerator with a small area overhead of 64.92 \u0000<inline-formula><tex-math>$text{mm}^{2}$</tex-math></inline-formula>\u0000 based on 45 nm process.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"9-12"},"PeriodicalIF":1.4,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haseung Bong;Nahyeon Kang;Youngsok Kim;Joonsung Kim;Hanhwi Jang
{"title":"IntervalSim++: Enhanced Interval Simulation for Unbalanced Processor Designs","authors":"Haseung Bong;Nahyeon Kang;Youngsok Kim;Joonsung Kim;Hanhwi Jang","doi":"10.1109/LCA.2024.3514917","DOIUrl":"https://doi.org/10.1109/LCA.2024.3514917","url":null,"abstract":"As processor microarchitecture is getting complicated, an accurate analytic model becomes crucial for exploring large processor design space within limited development time. An interval simulation is a widely used analytic model for processor designs in the early stage. However, it cannot accurately model modern microarchitecture, which has an \u0000<italic>unbalanced</i>\u0000 pipeline. In this work, we introduce IntervalSim++, an accurate analytic model for a modern microarchitecture design based on the interval simulation. We identify key components highly related to the unbalanced pipeline and propose new modeling techniques atop the interval simulation without incurring significant overheads. Our evaluations show IntervalSim++ accurately models a modern out-of-order processor with minimal overheads, showing 1% average CPI error and only 8.8% simulation time increase compared to the baseline interval simulation.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"1-4"},"PeriodicalIF":1.4,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}