IEEE Computer Architecture Letters最新文献_第2页

Contention-Aware GPU Thread Block Scheduler for Efficient GPU-SSD 竞争感知GPU线程块调度高效GPU- ssd

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-07-07 DOI: 10.1109/LCA.2025.3586312

Xueyang Liu;Seonjin Na;Euijun Chung;Jiashen Cao;Jing Yang;Hyesoon Kim

引用次数: 0

HPN-SpGEMM: Hybrid PIM-NMP for SpGEMM

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-06-27 DOI: 10.1109/LCA.2025.3583758

Kwangrae Kim;Ki-Seok Chung

引用次数: 0

SAFE: Sharing-Aware Prefetching for Efficient GPU Memory Management With Unified Virtual Memory 安全：共享感知预取高效GPU内存管理与统一的虚拟内存

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-06-24 DOI: 10.1109/LCA.2025.3553143

Hyunkyun Shin;Seongtae Bang;Hyungwon Park;Daehoon Kim

{"title":"SAFE: Sharing-Aware Prefetching for Efficient GPU Memory Management With Unified Virtual Memory","authors":"Hyunkyun Shin;Seongtae Bang;Hyungwon Park;Daehoon Kim","doi":"10.1109/LCA.2025.3553143","DOIUrl":"https://doi.org/10.1109/LCA.2025.3553143","url":null,"abstract":"As the demand for GPU memory from applications such as machine learning continues to grow exponentially, maximizing GPU memory capacity has become increasingly important. Unified Virtual Memory (UVM), which combines host and GPU memory into a unified address space, allows GPUs to utilize more memory than their physical capacity. However, this advantage comes at the cost of significant overheads when accessing host memory. Although existing prefetching techniques help alleviate these overheads, they still encounter challenges when dealing with irregular workloads and dynamic mixed workloads. In this paper, we demonstrate that the regularity of workloads is strongly correlated with the sharing status of UVM memory blocks among the Streaming Multiprocessors (SMs) of GPUs, which in turn impacts the effectiveness of prefetching. In addition, we propose the <bold>Sharing <bold>Aware pre<bold>FEtching technique, <monospace>SAFE</monospace>, which dynamically adjusts prefetching strategies based on the sharing status of the accessed memory blocks. <monospace>SAFE</monospace> efficiently tracks the sharing status of the memory blocks by leveraging unified TLBs (uTLBs) and enforces tailored prefetching configurations for each block. This approach requires no hardware modifications and incurs negligible performance overhead. Our evaluation shows that <monospace>SAFE</monospace> achieves up to a 6.5× performance improvement over UVM default prefetcher for workloads with predominantly irregular memory access patterns, with an average improvement of 3.6×.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"117-120"},"PeriodicalIF":1.4,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144472587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HINT: A Hardware Platform for Intra-Host NIC Traffic and SmartNIC Emulation 提示：主机内网卡流量和智能网卡仿真的硬件平台

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-06-23 DOI: 10.1109/LCA.2025.3582481

Jiaqi Lou;Yu Li;Srikar Vanavasam;Nam Sung Kim

{"title":"HINT: A Hardware Platform for Intra-Host NIC Traffic and SmartNIC Emulation","authors":"Jiaqi Lou;Yu Li;Srikar Vanavasam;Nam Sung Kim","doi":"10.1109/LCA.2025.3582481","DOIUrl":"https://doi.org/10.1109/LCA.2025.3582481","url":null,"abstract":"Recent performance advancements in inter-host networking demand innovations in intra-host communication and SmartNIC-accelerated in-network processing. However, developing novel SmartNIC features remains difficult due to absence of hardware observability and low-cost, deterministic testing environments with existing software-based or commercial development platforms. While FPGA-based SmartNICs offer high flexibility and performance for packet processing acceleration, existing solutions support only a limited subset of network technologies widely used in commercial datacenters. To address these challenges, we introduce HINT, an FPGA-based development and emulation platform that transparently mimics a commercial SmartNIC in the system, featuring controlled network traffic generation with a high-performance traffic engine and kernel-bypass network technologies. It also supports configurable workload patterns, nanosecond-level latency measurement, and a reconfigurable Receive Side Scaling (RSS) engine for load balancing. Our evaluation shows that HINT achieves 91% of PCIe’s theoretical efficiency, providing a highly effective and scalable platform to emulate an end-to-end system with support for diverse network stacks. HINT thus establishes an accessible, high-fidelity platform for SmartNIC development and emulation, along with architectural exploration of intra-host communication.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"261-264"},"PeriodicalIF":1.4,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11048525","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144880525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Time Series Machine Learning Models for Precise SSD Access Latency Prediction 用于SSD访问延迟精确预测的时间序列机器学习模型

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-06-20 DOI: 10.1109/LCA.2025.3581580

Bikrant Das Sharma;Houxiang Ji;Ipoom Jeong;Nam Sung Kim

{"title":"Time Series Machine Learning Models for Precise SSD Access Latency Prediction","authors":"Bikrant Das Sharma;Houxiang Ji;Ipoom Jeong;Nam Sung Kim","doi":"10.1109/LCA.2025.3581580","DOIUrl":"https://doi.org/10.1109/LCA.2025.3581580","url":null,"abstract":"Solid State Drives (SSDs) have become the dominant storage solution over the past few years. A key component of SSDs is the controller, which manages communication between the host and flash memory, optimizing data transfer speeds, integrity, and lifespan. However, modern SSDs function as closed boxes, as manufacturers do not disclose firmware and controller details. Meanwhile, read and write latencies are affected by various internal optimizations, such as wear-leveling and garbage collection, making precise latency prediction challenging. Existing approaches rely on trace-driven simulation or machine learning, but either (1) just classify operations into broad latency categories (e.g., fast or slow), including software stack overhead, or (2) make imprecise predictions while consuming significant system resources and time. For system simulation, latency predictions must be both fast and accurate, focusing solely on device-level delays excluding OS overhead, which is modeled separately. To tackle these challenges, this paper presents time series machine learning models to accurately predict hardware-only SSD latencies across diverse workloads. Our evaluation shows that the proposed model predicts 85%–95% of individual I/O latencies within a 10% error margin, outperforming existing simulators and ML models, which achieve only 6%–37% accuracy, while also providing 4×–255× speedups in prediction latency.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"233-236"},"PeriodicalIF":1.4,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144814153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MOST: Memory Oversubscription-Aware Scheduling for Tensor Migration on GPU Unified Storage GPU统一存储上张量迁移的内存超订阅感知调度

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-06-17 DOI: 10.1109/LCA.2025.3580264

Junsu Kim;Jaebeom Jeon;Jaeyong Park;Sangun Choi;Minseong Gil;Seokin Hong;Gunjae Koo;Myung Kuk Yoon;Yunho Oh

{"title":"MOST: Memory Oversubscription-Aware Scheduling for Tensor Migration on GPU Unified Storage","authors":"Junsu Kim;Jaebeom Jeon;Jaeyong Park;Sangun Choi;Minseong Gil;Seokin Hong;Gunjae Koo;Myung Kuk Yoon;Yunho Oh","doi":"10.1109/LCA.2025.3580264","DOIUrl":"https://doi.org/10.1109/LCA.2025.3580264","url":null,"abstract":"Deep Neural Network (DNN) training demands large memory capacities that exceed the limits of current GPU onboard memory. Expanding GPU memory with SSDs is a cost-effective approach. However, the low bandwidth of SSDs introduces severe performance bottlenecks in data management, particularly for Unified Virtual Memory (UVM)-based systems. The default on-demand migration mechanism in UVM causes frequent page faults and stalls, exacerbated by memory oversubscription and eviction processes along the critical path. To address these challenges, this paper proposes Memory Oversubscription-aware Scheduling for Tensor Migration (MOST), a software framework designed to improve data migration in UVM environments. MOST profiles memory access behavior and quantifies the impact of memory oversubscription stalls and schedules tensor migrations to minimize overall training time. With the profiling results, MOST executes newly designed pre-eviction and prefetching instructions within DNN kernel code. MOST effectively selects and migrates tensors that can mitigate memory oversubscription stalls, thus reducing training time. Our evaluation shows that MOST achieves an average speedup of 22.9% and 12.8% over state-of-the-art techniques, DeepUM and G10, respectively.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"213-216"},"PeriodicalIF":1.4,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144680906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Stardust: Scalable and Transferable Workload Mapping for Large AI on Multi-Chiplet Systems 星尘：多芯片系统上大型人工智能的可扩展和可转移工作负载映射

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-06-17 DOI: 10.1109/LCA.2025.3580562

Wencheng Zou;Feiyun Zhao;Nan Wu

引用次数: 0

pNet-gem5: Full-System Simulation With High-Performance Networking Enabled by Parallel Network Packet Processing pNet-gem5：通过并行网络数据包处理实现高性能网络的全系统仿真

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-06-06 DOI: 10.1109/LCA.2025.3577232

Jongmin Shin;Seongtae Bang;Gyeongseo Park;Daehoon Kim

{"title":"pNet-gem5: Full-System Simulation With High-Performance Networking Enabled by Parallel Network Packet Processing","authors":"Jongmin Shin;Seongtae Bang;Gyeongseo Park;Daehoon Kim","doi":"10.1109/LCA.2025.3577232","DOIUrl":"https://doi.org/10.1109/LCA.2025.3577232","url":null,"abstract":"Modern server processors in data centers equipped with high-performance networking technologies (e.g., 100 Gigabit Ethernet) commonly support parallel packet processing via multi-queue NICs, enabling multiple cores to efficiently handle massive traffic loads. However, existing architectural simulators such as <monospace>gem5</monospace> lack support for these techniques and suffer from limited bandwidth due to outdated networking models. Although a recent study introduced a simulation framework supporting userspace high-performance networking via the Data Plane Development Kit (DPDK), many applications still rely on kernel-based networking. To address these limitations, we present <monospace>pNet-gem5</monospace>, a full-system simulation framework designed to model server systems under high-performance network workloads, targeting data center architecture research. <monospace>pNet-gem5</monospace> extends <monospace>gem5</monospace> by supporting parallel packet processing on multi-core systems through the integration of multiple hardware queues and a more advanced interrupt mechanism—Message Signaled Interrupts (MSI)—which allows each NIC queue to be mapped to a dedicated core with its own IRQ. It also provides a high-performance network interface and device driver that support scalable and configurable packet distribution between hardware and software. Moreover, by decoupling packet distribution and scheduling from NIC core logic, <monospace>pNet-gem5</monospace> enables flexible experimentation with custom policies. As a result, <monospace>pNet-gem5</monospace> enables more realistic simulation of modern server environments by modeling multi-queue NICs and supporting bandwidths up to 46 Gbps—a significant improvement over the previous limit of only a few Gbps and more closely aligned with today’s tens-of-Gbps networks.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"193-196"},"PeriodicalIF":1.4,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144536558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Architectural Sustainability Indicator 建筑可持续发展指标

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-06-05 DOI: 10.1109/LCA.2025.3576891

Jaime Roelandts;Ajeya Naithani;Lieven Eeckhout

引用次数: 0

WoperTM: Got Nacks? Use Them! 有零食吗？使用它们!

IF 1.4 3区计算机科学

IEEE Computer Architecture Letters Pub Date : 2025-04-28 DOI: 10.1109/LCA.2025.3565199

Víctor Nicolás-Conesa;Rubén Titos-Gil;Ricardo Fernández-Pascual;Manuel E. Acacio;Alberto Ros

{"title":"WoperTM: Got Nacks? Use Them!","authors":"Víctor Nicolás-Conesa;Rubén Titos-Gil;Ricardo Fernández-Pascual;Manuel E. Acacio;Alberto Ros","doi":"10.1109/LCA.2025.3565199","DOIUrl":"https://doi.org/10.1109/LCA.2025.3565199","url":null,"abstract":"The simplicity of requester-wins has made it the preferred choice for conflict resolution in commercial implementations of Hardware Transactional Memory (HTM), which typically have relied on conventional locking to escape from conflict-induced livelocks. Prior work advocates for combining requester-wins and requester-loses to ensure progress for higher-priority transactions, yet it fails to take full advantage of the available features, namely, protocol support for <italic>nacks. This paper introduces WoperTM, a dual-policy, best-effort HTM design that resolves conflicts using <italic>requester-loses policy in the common case. Our key insight is that, since <italic>nacks are required to support priorities in HTM, performance can be improved at nearly no extra cost by allowing regular transactions to benefit from requester-loses, instead of only those involving a high-priority transaction. Experimental results using gem5 and STAMP show that WoperTM can significantly reduce squashed work and improve execution times by 12% with respect to <italic>power transactions, with negligible hardware overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"157-160"},"PeriodicalIF":1.4,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0