{"title":"Cache and Near-Data Co-Design for Chiplets","authors":"Arteen Abrishami;Zhengrong Wang;Tony Nowatzki","doi":"10.1109/LCA.2025.3564535","DOIUrl":"https://doi.org/10.1109/LCA.2025.3564535","url":null,"abstract":"Vendors are increasingly adopting chiplet-based designs to manage cost for large-scale multi-cores. While near-data computing, a paradigm involving offloading computation near where data is located in memory, has been studied in the context of monolithic chip designs – its applications to chiplets remain unexplored. In this letter, we explore how the paradigm extends to chiplets in a system where computation is offloaded to accelerators collocated within the last-level-cache structure. We explore both shared and private last-level-cache designs across a variety of different workloads, both large-scale graph computations and more regular-access workloads, in order to understand how to optimize the cache and topology design for near-data workloads. We find that with a mesh chiplet architecture with shared last-level-cache (LLC), near-data optimization can achieve an 8.70× speedup on graph workloads, providing an even greater benefit than in traditional systems.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"149-152"},"PeriodicalIF":1.4,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"In-Memory Computing Accelerator for Iterative Linear Algebra Solvers","authors":"Rui Liu;Zerun Li;Xiaoyu Zhang;Xiaoming Chen;Yinhe Han;Minghua Tang","doi":"10.1109/LCA.2025.3563365","DOIUrl":"https://doi.org/10.1109/LCA.2025.3563365","url":null,"abstract":"Iterative linear solvers are a crucial kernel in many numerical analysis problems. The performance and energy efficiency of iterative solvers based on traditional architectures are severely constrained by the memory wall bottleneck. Computing-in-memory (CIM) has the potential to enhance solving efficiency. Existing CIM architectures are mostly customized for specific algorithms and primarily focus on handling fixed-point operations, which makes them difficult to meet the demands of diverse and high-precision applications. In this work, we propose a CIM architecture that natively supports various iterative linear solvers based on floating-point operations. We develop a new instruction set for the accelerator, which can be flexibly combined to implement various iterative solvers. The evaluation results show that, compared with the GPU implementation, our accelerator achieves more than 10.1× speedup and 6.8× energy savings when executing different iterative solvers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"161-164"},"PeriodicalIF":1.4,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring Volatile FPGAs Potential for Accelerating Energy-Harvesting IoT Applications","authors":"Aalaa M.A. Babai;Koji Inoue","doi":"10.1109/LCA.2025.3563105","DOIUrl":"https://doi.org/10.1109/LCA.2025.3563105","url":null,"abstract":"Low-power volatile FPGAs (VFPGAs) naturally meet the intertwined processing and flexibility demands of IoT devices. However, as IoT devices shift toward Energy Harvesting (EH) for self-sustained operation, VFPGAs are overlooked because they struggle under harvested power. Their volatile SRAM configuration memory cells frequently lose their data, causing high reconfiguration penalties. These penalties grow with FPGAs’ resource usage, limiting it under EH. Still, advances in low-power FPGAs and energy-buffering systems’ efficiency motivate us to explore EH-powered FPGAs. Thus, we analyze the interplay of their resources, performance, and reconfiguration; simulate their operation under different EH conditions; and show how they can be utilized up to an application- and EH-dependent threshold.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"137-140"},"PeriodicalIF":1.4,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shunchen Shi;Fan Yang;Zhichun Li;Xueqi Li;Ninghui Sun
{"title":"Exploring the DIMM PIM Architecture for Accelerating Time Series Analysis","authors":"Shunchen Shi;Fan Yang;Zhichun Li;Xueqi Li;Ninghui Sun","doi":"10.1109/LCA.2025.3562431","DOIUrl":"https://doi.org/10.1109/LCA.2025.3562431","url":null,"abstract":"Time series analysis (TSA) is an important technique for extracting information from domain data. TSA is memory-bound on conventional platforms due to excessive off-chip data movements between processing units and the main memory of the system. Processing in memory (PIM) is a paradigm that alleviates the bottleneck of memory access for data-intensive applications by enabling computation to be performed directly within memory. In this paper, we first perform profiling to characterize TSA on conventional CPUs. Then, we implement TSA on real-world commercial DRAM Dual-Inline Memory Module (DIMM) PIM hardware UPMEM and identify computation as the primary bottleneck on PIM. Finally, we evaluate the impact of enhancing the computational capability of current DIMM PIM hardware on accelerating TSA. Overall, our work provides insights for designing the optimized DIMM PIM architecture for high-performance and efficient time series analysis.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"169-172"},"PeriodicalIF":1.4,"publicationDate":"2025-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144196840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Segin: Synergistically Enabling Fine-Grained Multi-Tenant and Resource Optimized SpMV","authors":"Helya Hosseini;Ubaid Bakhtiar;Donghyeon Joo;Bahar Asgari","doi":"10.1109/LCA.2025.3562120","DOIUrl":"https://doi.org/10.1109/LCA.2025.3562120","url":null,"abstract":"Sparse matrix-vector multiplication (SpMV) is a critical operation across numerous application domains. As a memory-bound kernel, SpMV does not require a complex compute engine but still needs efficient use of available compute units to achieve peak performance efficiently. However, sparsity causes resource underutilization. To efficiently run SpMV, we propose Segin that leverages a novel <italic>fine-grained multi-tenancy</i>, allowing multiple SpMV operations to be executed simultaneously on a single hardware with minimal modifications, which in turn improves throughput. To achieve this, Segin employs hierarchical bitmaps, hence a lightweight logical circuit, to quickly and efficiently identify optimal pairs of sparse matrices to overlap. Our evaluations demonstrate that Segin can improve throughput by 1.92×, while enhancing resource utilization.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"181-184"},"PeriodicalIF":1.4,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144196729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MixDiT: Accelerating Image Diffusion Transformer Inference With Mixed-Precision MX Quantization","authors":"Daeun Kim;Jinwoo Hwang;Changhun Oh;Jongse Park","doi":"10.1109/LCA.2025.3560786","DOIUrl":"https://doi.org/10.1109/LCA.2025.3560786","url":null,"abstract":"<underline>Di</u>ffusion <underline>T</u>ransformer (DiT) has driven significant progress in image generation tasks. However, DiT inferencing is notoriously compute-intensive and incurs long latency even on datacenter-scale GPUs, primarily due to its iterative nature and heavy reliance on GEMM operations inherent to its encoder-based structure. To address the challenge, prior work has explored quantization, but achieving low-precision quantization for DiT inferencing with both high accuracy and substantial speedup remains an open problem. To this end, this paper proposes MixDiT, an algorithm-hardware co-designed acceleration solution that exploits mixed Microscaling (MX) formats to quantize DiT activation values. MixDiTquantizes the DiT activation tensors by selectively applying higher precision to magnitude-based outliers, which produce mixed-precision GEMM operations. To achieve tangible speedup from the mixed-precision arithmetic, we design a MixDiTaccelerator that enables precision-flexible multiplications and efficient MX precision conversions. Our experimental results show that MixDiTdelivers a speedup of 2.10–5.32× over RTX 3090, with no loss in FID.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"141-144"},"PeriodicalIF":1.4,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"L-DTC: Load-based Dynamic Throughput Control for Guaranteed I/O Performance in Virtualized Environments","authors":"TaeHoon Kim;Jaechun No","doi":"10.1109/LCA.2025.3559454","DOIUrl":"https://doi.org/10.1109/LCA.2025.3559454","url":null,"abstract":"In this letter, we identified an issue where the I/O performance of specific tasks could not be guaranteed during multi-process I/O operations, despite the use of the latest storage technologies in virtualized environments. To address this issue, we propose <italic>L-DTC</i>, a novel Load-based Dynamic Throughput Control technique, designed to achieve the guaranteed I/O performance in virtualized environments. Operating at the hypervisor level, <italic>L-DTC</i> provides fine-grained throughput control based on I/O queues and ensures the independent I/O performance for each process by allowing users to define maximum and minimum throughput levels for each queue. We conducted an evaluation of <italic>L-DTC</i> and confirmed that it successfully guarantees the I/O performance requirements of specific processes in multi-process environments. Furthermore, <italic>L-DTC</i> achieved more stable I/O performance compared to existing methods, with improvements in I/O performance of up to 2.1 times, regardless of the I/O scheduler.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"145-148"},"PeriodicalIF":1.4,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liang Yan;Xiaoyang Lu;Xiaoming Chen;Yinhe Han;Xian-He Sun
{"title":"Pyramid: Accelerating LLM Inference With Cross-Level Processing-in-Memory","authors":"Liang Yan;Xiaoyang Lu;Xiaoming Chen;Yinhe Han;Xian-He Sun","doi":"10.1109/LCA.2025.3559738","DOIUrl":"https://doi.org/10.1109/LCA.2025.3559738","url":null,"abstract":"Integrating processing-in-memory (PIM) with GPUs accelerates large language model (LLM) inference, but existing GPU-PIM systems encounter several challenges. While GPUs excel in large general matrix-matrix multiplications (GEMM), they struggle with small-scale operations better suited for PIM, which currently cannot handle them independently. Additionally, the computational demands of activation operations exceed the capabilities of current PIM technologies, leading to excessive data movement between the GPU and memory. PIM's potential for general matrix-vector multiplications (GEMV) is also limited by insufficient support for fine-grained parallelism. To address these issues, we propose Pyramid, a novel GPU-PIM system that optimizes PIM for LLM inference by strategically allocating cross-level computational resources within PIM to meet diverse needs and leveraging the strengths of both technologies. Evaluation results demonstrate that Pyramid outperforms existing systems like NeuPIM, AiM, and AttAcc by factors of 2.31×, <inline-formula><tex-math>$1.91times$</tex-math></inline-formula>, and <inline-formula><tex-math>$1.72times$</tex-math></inline-formula>, respectively.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"121-124"},"PeriodicalIF":1.4,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory-Centric MCM-GPU Architecture","authors":"Hossein SeyyedAghaei;Mahmood Naderan-Tahan;Magnus Jahre;Lieven Eeckhout","doi":"10.1109/LCA.2025.3553766","DOIUrl":"https://doi.org/10.1109/LCA.2025.3553766","url":null,"abstract":"The demand for powerful GPUs continues to grow, driven by modern-day applications that require ever increasing computational power and memory bandwidth. Multi-Chip Module (MCM) GPUs provide the scalability potential by integrating GPU chiplets on an interposer substrate, however, they are hindered by their GPU-centric design, i.e., off-chip GPU bandwidth is statically (at design time) allocated to local versus remote memory accesses. This paper presents the memory-centric MCM-GPU architecture. By connecting the HBM stacks on the interposer, rather than the GPUs, and by connecting the GPUs to bridges on the interposer network, the full off-chip GPU bandwidth can be dynamically allocated to local and remote memory accesses. Preliminary results demonstrate the potential of the memory-centric architecture offering an average 1.36× (and up to 1.90×) performance improvement over a GPU-centric architecture.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"101-104"},"PeriodicalIF":1.4,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143817792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing and Exploiting Memory Hierarchy Parallelism With MLP Stacks","authors":"Adnan Hasnat;Wim Heirman;Shoaib Akram","doi":"10.1109/LCA.2025.3558808","DOIUrl":"https://doi.org/10.1109/LCA.2025.3558808","url":null,"abstract":"Obtaining high instruction throughput on modern CPUs requires generating a high degree of memory-level parallelism (MLP). MLP is typically reported as a quantitative metric at the DRAM level. However, understanding the reasons that hinder memory parallelism requires more insightful metrics and visualizations. This paper proposes a new taxonomy of MLP metrics, splitting MLP into core and prefetch components and measuring both miss and hit cache level parallelism. Our key contribution is an MLP stack, a visualization that integrates these metrics, and connects then to performance by showing the CPI contribution of each memory level. The stack also shows speculative parallelism from dependency-bound and structural-hazard-bound loads. We implement the MLP stack in a processor simulator and conduct case studies that demonstrate the potential for targeting software optimizations (e.g., software prefetching), and hardware improvements (e.g., instruction window sizing).","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"125-128"},"PeriodicalIF":1.4,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}