IEEE Computer Architecture Letters最新文献

筛选
英文 中文
WoperTM: Got Nacks? Use Them! 有零食吗?使用它们!
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-04-28 DOI: 10.1109/LCA.2025.3565199
Víctor Nicolás-Conesa;Rubén Titos-Gil;Ricardo Fernández-Pascual;Manuel E. Acacio;Alberto Ros
{"title":"WoperTM: Got Nacks? Use Them!","authors":"Víctor Nicolás-Conesa;Rubén Titos-Gil;Ricardo Fernández-Pascual;Manuel E. Acacio;Alberto Ros","doi":"10.1109/LCA.2025.3565199","DOIUrl":"https://doi.org/10.1109/LCA.2025.3565199","url":null,"abstract":"The simplicity of requester-wins has made it the preferred choice for conflict resolution in commercial implementations of Hardware Transactional Memory (HTM), which typically have relied on conventional locking to escape from conflict-induced livelocks. Prior work advocates for combining requester-wins and requester-loses to ensure progress for higher-priority transactions, yet it fails to take full advantage of the available features, namely, protocol support for <italic>nacks</i>. This paper introduces WoperTM, a dual-policy, best-effort HTM design that resolves conflicts using <italic>requester-loses</i> policy in the common case. Our key insight is that, since <italic>nacks</i> are required to support priorities in HTM, performance can be improved at nearly no extra cost by allowing regular transactions to benefit from requester-loses, instead of only those involving a high-priority transaction. Experimental results using gem5 and STAMP show that WoperTM can significantly reduce squashed work and improve execution times by 12% with respect to <italic>power transactions</i>, with negligible hardware overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"157-160"},"PeriodicalIF":1.4,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cache and Near-Data Co-Design for Chiplets 小芯片的缓存和近数据协同设计
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-04-25 DOI: 10.1109/LCA.2025.3564535
Arteen Abrishami;Zhengrong Wang;Tony Nowatzki
{"title":"Cache and Near-Data Co-Design for Chiplets","authors":"Arteen Abrishami;Zhengrong Wang;Tony Nowatzki","doi":"10.1109/LCA.2025.3564535","DOIUrl":"https://doi.org/10.1109/LCA.2025.3564535","url":null,"abstract":"Vendors are increasingly adopting chiplet-based designs to manage cost for large-scale multi-cores. While near-data computing, a paradigm involving offloading computation near where data is located in memory, has been studied in the context of monolithic chip designs – its applications to chiplets remain unexplored. In this letter, we explore how the paradigm extends to chiplets in a system where computation is offloaded to accelerators collocated within the last-level-cache structure. We explore both shared and private last-level-cache designs across a variety of different workloads, both large-scale graph computations and more regular-access workloads, in order to understand how to optimize the cache and topology design for near-data workloads. We find that with a mesh chiplet architecture with shared last-level-cache (LLC), near-data optimization can achieve an 8.70× speedup on graph workloads, providing an even greater benefit than in traditional systems.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"149-152"},"PeriodicalIF":1.4,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
In-Memory Computing Accelerator for Iterative Linear Algebra Solvers 迭代线性代数求解的内存计算加速器
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-04-22 DOI: 10.1109/LCA.2025.3563365
Rui Liu;Zerun Li;Xiaoyu Zhang;Xiaoming Chen;Yinhe Han;Minghua Tang
{"title":"In-Memory Computing Accelerator for Iterative Linear Algebra Solvers","authors":"Rui Liu;Zerun Li;Xiaoyu Zhang;Xiaoming Chen;Yinhe Han;Minghua Tang","doi":"10.1109/LCA.2025.3563365","DOIUrl":"https://doi.org/10.1109/LCA.2025.3563365","url":null,"abstract":"Iterative linear solvers are a crucial kernel in many numerical analysis problems. The performance and energy efficiency of iterative solvers based on traditional architectures are severely constrained by the memory wall bottleneck. Computing-in-memory (CIM) has the potential to enhance solving efficiency. Existing CIM architectures are mostly customized for specific algorithms and primarily focus on handling fixed-point operations, which makes them difficult to meet the demands of diverse and high-precision applications. In this work, we propose a CIM architecture that natively supports various iterative linear solvers based on floating-point operations. We develop a new instruction set for the accelerator, which can be flexibly combined to implement various iterative solvers. The evaluation results show that, compared with the GPU implementation, our accelerator achieves more than 10.1× speedup and 6.8× energy savings when executing different iterative solvers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"161-164"},"PeriodicalIF":1.4,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Volatile FPGAs Potential for Accelerating Energy-Harvesting IoT Applications 探索易失性fpga加速能量收集物联网应用的潜力
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-04-21 DOI: 10.1109/LCA.2025.3563105
Aalaa M.A. Babai;Koji Inoue
{"title":"Exploring Volatile FPGAs Potential for Accelerating Energy-Harvesting IoT Applications","authors":"Aalaa M.A. Babai;Koji Inoue","doi":"10.1109/LCA.2025.3563105","DOIUrl":"https://doi.org/10.1109/LCA.2025.3563105","url":null,"abstract":"Low-power volatile FPGAs (VFPGAs) naturally meet the intertwined processing and flexibility demands of IoT devices. However, as IoT devices shift toward Energy Harvesting (EH) for self-sustained operation, VFPGAs are overlooked because they struggle under harvested power. Their volatile SRAM configuration memory cells frequently lose their data, causing high reconfiguration penalties. These penalties grow with FPGAs’ resource usage, limiting it under EH. Still, advances in low-power FPGAs and energy-buffering systems’ efficiency motivate us to explore EH-powered FPGAs. Thus, we analyze the interplay of their resources, performance, and reconfiguration; simulate their operation under different EH conditions; and show how they can be utilized up to an application- and EH-dependent threshold.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"137-140"},"PeriodicalIF":1.4,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MixDiT: Accelerating Image Diffusion Transformer Inference With Mixed-Precision MX Quantization MixDiT:用混合精度MX量化加速图像扩散变压器推理
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-04-15 DOI: 10.1109/LCA.2025.3560786
Daeun Kim;Jinwoo Hwang;Changhun Oh;Jongse Park
{"title":"MixDiT: Accelerating Image Diffusion Transformer Inference With Mixed-Precision MX Quantization","authors":"Daeun Kim;Jinwoo Hwang;Changhun Oh;Jongse Park","doi":"10.1109/LCA.2025.3560786","DOIUrl":"https://doi.org/10.1109/LCA.2025.3560786","url":null,"abstract":"<underline>Di</u>ffusion <underline>T</u>ransformer (DiT) has driven significant progress in image generation tasks. However, DiT inferencing is notoriously compute-intensive and incurs long latency even on datacenter-scale GPUs, primarily due to its iterative nature and heavy reliance on GEMM operations inherent to its encoder-based structure. To address the challenge, prior work has explored quantization, but achieving low-precision quantization for DiT inferencing with both high accuracy and substantial speedup remains an open problem. To this end, this paper proposes MixDiT, an algorithm-hardware co-designed acceleration solution that exploits mixed Microscaling (MX) formats to quantize DiT activation values. MixDiTquantizes the DiT activation tensors by selectively applying higher precision to magnitude-based outliers, which produce mixed-precision GEMM operations. To achieve tangible speedup from the mixed-precision arithmetic, we design a MixDiTaccelerator that enables precision-flexible multiplications and efficient MX precision conversions. Our experimental results show that MixDiTdelivers a speedup of 2.10–5.32× over RTX 3090, with no loss in FID.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"141-144"},"PeriodicalIF":1.4,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
L-DTC: Load-based Dynamic Throughput Control for Guaranteed I/O Performance in Virtualized Environments L-DTC:虚拟化环境中基于负载的动态吞吐量控制,保证I/O性能
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-04-10 DOI: 10.1109/LCA.2025.3559454
TaeHoon Kim;Jaechun No
{"title":"L-DTC: Load-based Dynamic Throughput Control for Guaranteed I/O Performance in Virtualized Environments","authors":"TaeHoon Kim;Jaechun No","doi":"10.1109/LCA.2025.3559454","DOIUrl":"https://doi.org/10.1109/LCA.2025.3559454","url":null,"abstract":"In this letter, we identified an issue where the I/O performance of specific tasks could not be guaranteed during multi-process I/O operations, despite the use of the latest storage technologies in virtualized environments. To address this issue, we propose <italic>L-DTC</i>, a novel Load-based Dynamic Throughput Control technique, designed to achieve the guaranteed I/O performance in virtualized environments. Operating at the hypervisor level, <italic>L-DTC</i> provides fine-grained throughput control based on I/O queues and ensures the independent I/O performance for each process by allowing users to define maximum and minimum throughput levels for each queue. We conducted an evaluation of <italic>L-DTC</i> and confirmed that it successfully guarantees the I/O performance requirements of specific processes in multi-process environments. Furthermore, <italic>L-DTC</i> achieved more stable I/O performance compared to existing methods, with improvements in I/O performance of up to 2.1 times, regardless of the I/O scheduler.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"145-148"},"PeriodicalIF":1.4,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pyramid: Accelerating LLM Inference With Cross-Level Processing-in-Memory 金字塔:用内存中的跨层处理加速LLM推理
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-04-10 DOI: 10.1109/LCA.2025.3559738
Liang Yan;Xiaoyang Lu;Xiaoming Chen;Yinhe Han;Xian-He Sun
{"title":"Pyramid: Accelerating LLM Inference With Cross-Level Processing-in-Memory","authors":"Liang Yan;Xiaoyang Lu;Xiaoming Chen;Yinhe Han;Xian-He Sun","doi":"10.1109/LCA.2025.3559738","DOIUrl":"https://doi.org/10.1109/LCA.2025.3559738","url":null,"abstract":"Integrating processing-in-memory (PIM) with GPUs accelerates large language model (LLM) inference, but existing GPU-PIM systems encounter several challenges. While GPUs excel in large general matrix-matrix multiplications (GEMM), they struggle with small-scale operations better suited for PIM, which currently cannot handle them independently. Additionally, the computational demands of activation operations exceed the capabilities of current PIM technologies, leading to excessive data movement between the GPU and memory. PIM's potential for general matrix-vector multiplications (GEMV) is also limited by insufficient support for fine-grained parallelism. To address these issues, we propose Pyramid, a novel GPU-PIM system that optimizes PIM for LLM inference by strategically allocating cross-level computational resources within PIM to meet diverse needs and leveraging the strengths of both technologies. Evaluation results demonstrate that Pyramid outperforms existing systems like NeuPIM, AiM, and AttAcc by factors of 2.31×, <inline-formula><tex-math>$1.91times$</tex-math></inline-formula>, and <inline-formula><tex-math>$1.72times$</tex-math></inline-formula>, respectively.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"121-124"},"PeriodicalIF":1.4,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Memory-Centric MCM-GPU Architecture 以内存为中心的MCM-GPU架构
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-04-09 DOI: 10.1109/LCA.2025.3553766
Hossein SeyyedAghaei;Mahmood Naderan-Tahan;Magnus Jahre;Lieven Eeckhout
{"title":"Memory-Centric MCM-GPU Architecture","authors":"Hossein SeyyedAghaei;Mahmood Naderan-Tahan;Magnus Jahre;Lieven Eeckhout","doi":"10.1109/LCA.2025.3553766","DOIUrl":"https://doi.org/10.1109/LCA.2025.3553766","url":null,"abstract":"The demand for powerful GPUs continues to grow, driven by modern-day applications that require ever increasing computational power and memory bandwidth. Multi-Chip Module (MCM) GPUs provide the scalability potential by integrating GPU chiplets on an interposer substrate, however, they are hindered by their GPU-centric design, i.e., off-chip GPU bandwidth is statically (at design time) allocated to local versus remote memory accesses. This paper presents the memory-centric MCM-GPU architecture. By connecting the HBM stacks on the interposer, rather than the GPUs, and by connecting the GPUs to bridges on the interposer network, the full off-chip GPU bandwidth can be dynamically allocated to local and remote memory accesses. Preliminary results demonstrate the potential of the memory-centric architecture offering an average 1.36× (and up to 1.90×) performance improvement over a GPU-centric architecture.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"101-104"},"PeriodicalIF":1.4,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143817792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analyzing and Exploiting Memory Hierarchy Parallelism With MLP Stacks 利用MLP栈分析和开发内存层次并行性
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-04-08 DOI: 10.1109/LCA.2025.3558808
Adnan Hasnat;Wim Heirman;Shoaib Akram
{"title":"Analyzing and Exploiting Memory Hierarchy Parallelism With MLP Stacks","authors":"Adnan Hasnat;Wim Heirman;Shoaib Akram","doi":"10.1109/LCA.2025.3558808","DOIUrl":"https://doi.org/10.1109/LCA.2025.3558808","url":null,"abstract":"Obtaining high instruction throughput on modern CPUs requires generating a high degree of memory-level parallelism (MLP). MLP is typically reported as a quantitative metric at the DRAM level. However, understanding the reasons that hinder memory parallelism requires more insightful metrics and visualizations. This paper proposes a new taxonomy of MLP metrics, splitting MLP into core and prefetch components and measuring both miss and hit cache level parallelism. Our key contribution is an MLP stack, a visualization that integrates these metrics, and connects then to performance by showing the CPI contribution of each memory level. The stack also shows speculative parallelism from dependency-bound and structural-hazard-bound loads. We implement the MLP stack in a processor simulator and conduct case studies that demonstrate the potential for targeting software optimizations (e.g., software prefetching), and hardware improvements (e.g., instruction window sizing).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"125-128"},"PeriodicalIF":1.4,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimating CPI Stacks From Multiplexed Performance Counter Data Using Machine Learning 使用机器学习从多路性能计数器数据估计CPI堆栈
IF 1.4 3区 计算机科学
IEEE Computer Architecture Letters Pub Date : 2025-04-01 DOI: 10.1109/LCA.2025.3556644
Daniel Puckett;Tyler Tomer;Paul V. Gratz;Jiang Hu;Galen Shipman;Jered Dominguez-Trujillo;Kevin Sheridan
{"title":"Estimating CPI Stacks From Multiplexed Performance Counter Data Using Machine Learning","authors":"Daniel Puckett;Tyler Tomer;Paul V. Gratz;Jiang Hu;Galen Shipman;Jered Dominguez-Trujillo;Kevin Sheridan","doi":"10.1109/LCA.2025.3556644","DOIUrl":"https://doi.org/10.1109/LCA.2025.3556644","url":null,"abstract":"Optimizing software at runtime is much easier with a clear understanding of the bottlenecks facing the software. CPI stacks are a common method of visualizing these bottlenecks. However, existing proposals to implement CPI stacks require hardware modifications. To compute CPI stacks without modifying the CPU, we demonstrate CPI stacks can be estimated from existing performance counters using machine learning.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"129-132"},"PeriodicalIF":1.4,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信