IEEE Computer Architecture Letters最新文献

筛选
英文 中文
Hardware-Accelerated Kernel-Space Memory Compression Using Intel QAT 使用Intel QAT的硬件加速内核空间内存压缩
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-01-28 DOI: 10.1109/LCA.2025.3534831
Qirong Xia;Houxiang Ji;Yang Zhou;Nam Sung Kim
{"title":"Hardware-Accelerated Kernel-Space Memory Compression Using Intel QAT","authors":"Qirong Xia;Houxiang Ji;Yang Zhou;Nam Sung Kim","doi":"10.1109/LCA.2025.3534831","DOIUrl":"https://doi.org/10.1109/LCA.2025.3534831","url":null,"abstract":"Data compression has been widely used by datacenters to decrease the consumption of not only the memory and storage capacity but also the interconnect bandwidth. Nonetheless, the CPU cycles consumed for data compression notably contribute to the overall datacenter taxes. To provide a cost-efficient data compression capability for datacenters, Intel has introduced QuickAssist Technology (QAT), a PCIe-attached data-compression accelerator. In this work, we first comprehensively evaluate the compression/decompression performance of the latest <italic>on-chip</i> QAT accelerator and then compare it with that of the previous-generation <italic>off-chip</i> QAT accelerator. Subsequently, as a compelling application for QAT, we take a Linux memory optimization kernel feature: compressed cache for swap pages (<monospace>zswap</monospace>), re-implement it to use QAT efficiently, and then compare the performance of QAT-based <monospace>zswap</monospace> with that of CPU-based <monospace>zswap</monospace>. Our evaluation shows that the deployment of CPU-based <monospace>zswap</monospace> increases the tail latency of a co-running latency-sensitive application, Redis by 3.2-12.1×, while that of QAT-based <monospace>zswap</monospace> does not notably increase the tail latency compared to no deployment of <monospace>zswap</monospace>.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"57-60"},"PeriodicalIF":1.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10856688","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143619076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward Scalable RDMA Through Resource Prefetching 通过资源预取实现可扩展的RDMA
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-01-27 DOI: 10.1109/LCA.2025.3534188
Zhenlong Ma;Ning Kang;Fan Yang;Chongyang Hong;Jing Xu;Guojun Yuan;Peiheng Zhang;Zhan Wang;Ninghui Sun
{"title":"Toward Scalable RDMA Through Resource Prefetching","authors":"Zhenlong Ma;Ning Kang;Fan Yang;Chongyang Hong;Jing Xu;Guojun Yuan;Peiheng Zhang;Zhan Wang;Ninghui Sun","doi":"10.1109/LCA.2025.3534188","DOIUrl":"https://doi.org/10.1109/LCA.2025.3534188","url":null,"abstract":"RDMA network is being widely deployed in data centers, high-performance computing, and AI clusters. By offloading the network processing protocol stack to hardware, RDMA bypasses the operating system kernel, thereby enabling high performance and low CPU overhead. However, the protocol processing demands substantial communication resources, and due to the limited hardware resources, commercial NICs (Network Interface Cards) experience a significant number of cache misses in large-scale connection scenarios. This results in performance degradation, indicating that RDMA lacks scalability. In this paper, we first analyze the characteristics of resource access in RDMA. Based on these characteristics, we propose a resource access prediction and prefetching mechanism in the hardware, which preemptively fetches the resources required by the protocol processing pipeline to the on-chip cache. This mechanism increases the NIC’s cache hit ratio. Evaluation results demonstrate that our approach improves throughput by 125% and reduces latency by 17.9% under large-scale communication scenarios.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"77-80"},"PeriodicalIF":1.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143706789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPU-Centric Memory Tiering for LLM Serving With NVIDIA Grace Hopper Superchip 以gpu为中心的LLM内存分层与NVIDIA Grace Hopper超级芯片
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-01-23 DOI: 10.1109/LCA.2025.3533588
Woohyung Choi;Jinwoo Jeong;Hanhwi Jang;Jeongseob Ahn
{"title":"GPU-Centric Memory Tiering for LLM Serving With NVIDIA Grace Hopper Superchip","authors":"Woohyung Choi;Jinwoo Jeong;Hanhwi Jang;Jeongseob Ahn","doi":"10.1109/LCA.2025.3533588","DOIUrl":"https://doi.org/10.1109/LCA.2025.3533588","url":null,"abstract":"This study investigates the performance of serving large language models (LLMs) with a focus on the high-bandwidth interconnect between GPU and CPU using a real NVIDIA Grace Hopper Superchip. This architecture features a GPU-centric memory tiering system, comprising a performance tier with GPU memory and a capacity tier with host memory. We revisit a conventional pipelined execution for LLM inference, utilizing host memory connected via NVLink alongside GPU memory. For the Llama-3.1 8B base (FP16) model, such a GPU-centric tiered memory system meets the target latency requirements for both prefill and decoding while improving throughput compared to the in-memory case, where all model weights are maintained in GPU memory. However, even with NVLink-connected CPU memory, meeting latency constraints for large models like the 70B and 405B FP16 models remains challenging. To address this, we explore the efficacy of model quantization (e.g., AWQ) along with the pipelined execution. Our evaluation reveals that the model quantization makes the pipelined execution a viable solution for serving large models. For the Llama-3.1 70B and 405B AWQ models, we show that the pipelined execution achieves 1.6× and 2.9× throughput improvement, respectively, compared to the in-memory only case, while meeting the latency constraint.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"33-36"},"PeriodicalIF":1.4,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
2024 Reviewers List 2024审稿人名单
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-01-22 DOI: 10.1109/LCA.2025.3528619
{"title":"2024 Reviewers List","authors":"","doi":"10.1109/LCA.2025.3528619","DOIUrl":"https://doi.org/10.1109/LCA.2025.3528619","url":null,"abstract":"","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"i-ii"},"PeriodicalIF":1.4,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10849623","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPAM: Streamlined Prefetcher-Aware Multi-Threaded Cache Covert-Channel Attack 垃圾邮件:精简的预取器感知多线程缓存转换通道攻击
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-01-14 DOI: 10.1109/LCA.2025.3529213
E. Kritheesh;Biswabandan Panda
{"title":"SPAM: Streamlined Prefetcher-Aware Multi-Threaded Cache Covert-Channel Attack","authors":"E. Kritheesh;Biswabandan Panda","doi":"10.1109/LCA.2025.3529213","DOIUrl":"https://doi.org/10.1109/LCA.2025.3529213","url":null,"abstract":"Last-level cache (LLC) covert-channels exploit the cache timing differences to transmit information. In recent works, the attacks rely on a single sender and a single receiver. Streamline is the state-of-the-art cache covert channel attack that uses a shared array of addresses mapped to the payload bits, allowing parallelization of the encoding and decoding of bits. As multi-core systems are ubiquitous, multiple senders and receivers can be used to create a high bandwidth cache covert channel. However, this is not the case, and the bandwidth per thread is limited by various factors. We extend Streamline to a multi-threaded Streamline, where the senders buffer a few thousand bits at the LLC for the receivers to decode. We observe that these buffered bits are prone to eviction by the co-running processes before they are decoded. We propose SPAM, a multi-threaded covert-channel at the LLC. SPAM shows that fewer but faster senders must encode for more receivers to reduce this time frame. This ensures resilience to noise coming from cache activities of co-running applications. SPAM uses two different access patterns for the sender(s) and the receiver(s). The sender access pattern of the addresses is modified to leverage the hardware prefetchers to accelerate the loads while encoding. The receiver access pattern circumvents the hardware prefetchers for accurate load latency measurements. We demonstrate SPAM on a six-core (12-threaded) system, achieving a bit-rate of 12.21 MB/s at an error rate of 9.02% which is an improvement of over 70% over the state-of-the-art multi-threaded Streamline for comparable error rates when 50% of the co-running threads stress the cache system.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"25-28"},"PeriodicalIF":1.4,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Page Migrations in Operating Systems With Intel DSA 使用Intel DSA加速操作系统中的页面迁移
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-01-14 DOI: 10.1109/LCA.2025.3530093
Jongho Baik;Jonghyeon Kim;Chang Hyun Park;Jeongseob Ahn
{"title":"Accelerating Page Migrations in Operating Systems With Intel DSA","authors":"Jongho Baik;Jonghyeon Kim;Chang Hyun Park;Jeongseob Ahn","doi":"10.1109/LCA.2025.3530093","DOIUrl":"https://doi.org/10.1109/LCA.2025.3530093","url":null,"abstract":"Modern server-class CPUs are introducing special-purpose accelerators on the same chip to improve performance and efficiency for data-intensive applications. This paper presents a case for accelerating data migrations in operating systems with the Data Streaming Accelerator (DSA), a new feature by Intel. To the best of our knowledge, this is the first study that exploits a hardware-assisted data migration scheme in the operating system. We identify which Linux kernel components can benefit from the hardware acceleration, particularly focusing on the kernel subsystems that rely on the <monospace>migrate_pages()</monospace> kernel function. As the hardware accelerator is not suitable for transferring a small amount of data due to the HW setup overhead, this preliminary study concentrates on the design and implementation of accelerating <monospace>migrate_pages()</monospace> with DSA. We prototype a DSA-enabled Linux kernel and evaluate its effectiveness through two benchmarks demonstrating real-world page compaction (<monospace>kcompactd</monospace>) and promotion (<monospace>kdamond</monospace>) scenarios. In both cases, our prototype demonstrates improved throughput in page migration, benefiting both the kernel subsystem and applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"37-40"},"PeriodicalIF":1.4,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cooperative Memory Deduplication With Intel Data Streaming Accelerator 协作内存重复数据删除与英特尔数据流加速器
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-01-09 DOI: 10.1109/LCA.2025.3527458
Houxiang Ji;Minho Kim;Seonmu Oh;Daehoon Kim;Nam Sung Kim
{"title":"Cooperative Memory Deduplication With Intel Data Streaming Accelerator","authors":"Houxiang Ji;Minho Kim;Seonmu Oh;Daehoon Kim;Nam Sung Kim","doi":"10.1109/LCA.2025.3527458","DOIUrl":"https://doi.org/10.1109/LCA.2025.3527458","url":null,"abstract":"Memory deduplication plays a critical role in reducing memory consumption and the total cost of ownership (TCO) in hyperscalers, particularly as the advent of large language models imposes unprecedented demands on memory resources. However, conventional CPU-based memory deduplication can interfere with co-running applications, significantly impacting the performance of time-sensitive workloads. Intel introduced the <italic>on-chip</i> Data Streaming Accelerator (DSA), providing high-performance data movement and transformation capabilities, including comparison and checksum calculation, which are heavily utilized in the deduplication. In this work, we enhance a widely-used kernel-space memory deduplication feature, Kernel Samepage Merging (<monospace>ksm</monospace>), by selectively offloading these operations to the DSA. Our evaluation demonstrates that CPU-based <monospace>ksm</monospace> can lead to 5.0–10.9× increase in the tail latency of co-running applications while DSA-based <monospace>ksm</monospace> limits the latency increase to just 1.6× while achieving comparable memory savings.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"29-32"},"PeriodicalIF":1.4,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Performance Winograd Based Accelerator Architecture for Convolutional Neural Network 基于Winograd的高性能卷积神经网络加速器架构
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-01-08 DOI: 10.1109/LCA.2025.3525970
Vardhana M;Rohan Pinto
{"title":"High-Performance Winograd Based Accelerator Architecture for Convolutional Neural Network","authors":"Vardhana M;Rohan Pinto","doi":"10.1109/LCA.2025.3525970","DOIUrl":"https://doi.org/10.1109/LCA.2025.3525970","url":null,"abstract":"Convolutional Neural Networks are deployed mostly on GPUs or CPUs. However, due to the increasing complexity of architecture and growing performance requirements, these platforms may not be suitable for deploying inference engines. ASIC and FPGA implementations are appearing as superior alternatives to software-based solutions for achieving the required performance. In this article, an efficient architecture for accelerating convolution using the Winograd transform is proposed and implemented on FPGA. The proposed accelerator consumes 38% less resources as compared with conventional GEMM-based implementation. Analysis results indicate that our accelerator can achieve 3.5 TOP/s, 1.28 TOP/s, and 1.42 TOP/s for VGG16, ResNet18, and MobileNetV2 CNNs, respectively, at 250 MHz. The proposed accelerator demonstrates the best energy efficiency as compared with prior arts.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"21-24"},"PeriodicalIF":1.4,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PINSim: A Processing In- and Near-Sensor Simulator to Model Intelligent Vision Sensors PINSim:用于模拟智能视觉传感器的处理内传感器和近传感器模拟器
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-12-25 DOI: 10.1109/LCA.2024.3522777
Sepehr Tabrizchi;Mehrdad Morsali;David Pan;Shaahin Angizi;Arman Roohi
{"title":"PINSim: A Processing In- and Near-Sensor Simulator to Model Intelligent Vision Sensors","authors":"Sepehr Tabrizchi;Mehrdad Morsali;David Pan;Shaahin Angizi;Arman Roohi","doi":"10.1109/LCA.2024.3522777","DOIUrl":"https://doi.org/10.1109/LCA.2024.3522777","url":null,"abstract":"This letter introduces PINSim, a user-friendly and flexible framework for simulating emerging smart vision sensors in the early design stages. PINSim enables the realization of integrated sensing and processing near and in the sensor, effectively addressing challenges such as data movement and power-hungry analog-to-digital converters. The framework offers a flexible interface and a wide range of design options for customizing the efficiency and accuracy of processing-near/in-sensor-based accelerators using a hierarchical structure. Its organization spans from the device level upward to the algorithm level. PINSim realizes instruction-accurate evaluation of circuit-level performance metrics. PINSim achieves over <inline-formula><tex-math>$25,000times$</tex-math></inline-formula> speed-up compared to SPICE simulation with less than a 4.1% error rate on average. Furthermore, it supports both multilayer perceptron (MLP) and convolutional neural network (CNN) models, with limitations determined by IoT budget constraints. By facilitating the exploration and optimization of various design parameters, PiNSim empowers researchers and engineers to develop energy-efficient and high-performance smart vision sensors for a wide range of applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"17-20"},"PeriodicalIF":1.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ZoneBuffer: An Efficient Buffer Management Scheme for ZNS SSDs ZoneBuffer:适用于 ZNS 固态硬盘的高效缓冲区管理方案
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2024-12-16 DOI: 10.1109/LCA.2024.3498103
Hongtao Wang;Peiquan Jin
{"title":"ZoneBuffer: An Efficient Buffer Management Scheme for ZNS SSDs","authors":"Hongtao Wang;Peiquan Jin","doi":"10.1109/LCA.2024.3498103","DOIUrl":"https://doi.org/10.1109/LCA.2024.3498103","url":null,"abstract":"The introduction of Zoned Namespace SSDs (ZNS SSDs) presents new challenges for existing buffer management schemes. In addition to traditional SSD characteristics such as read/write asymmetry and limited write endurance, ZNS SSDs possess unique constraints, such as requiring sequential writes within each zone. These features make conventional buffering policies incompatible with ZNS SSDs. This paper introduces ZoneBuffer, a novel buffering scheme designed specifically for ZNS SSDs. ZoneBuffer's innovation lies in two key aspects. First, it introduces a new buffer structure comprising a Work Region and a Priority Region. The Priority Region is further divided into a clean page queue and a zone cluster of dirty pages. By confining buffer replacement to the Priority Region, ZoneBuffer ensures optimization for ZNS SSDs. Second, ZoneBuffer incorporates a lifetime-based clustering algorithm to group dirty pages within the Priority Region, optimizing write operations. Preliminary experiments conducted on a real ZNS SSD demonstrate the effectiveness of ZoneBuffer. Compared with conventional schemes like LRU and CFLRU, the results indicate that ZoneBuffer significantly improves performance.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"239-242"},"PeriodicalIF":1.4,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142825861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信