Xu Zhang , Yuan Cheng , Dingyang Zou , Ke Gu , Meiqi Wang , Zhongfeng Wang
{"title":"ADCIM: scalable construction of approximate digital compute-in-memory MACRO for energy-efficient attention computation","authors":"Xu Zhang , Yuan Cheng , Dingyang Zou , Ke Gu , Meiqi Wang , Zhongfeng Wang","doi":"10.1016/j.sysarc.2025.103512","DOIUrl":"10.1016/j.sysarc.2025.103512","url":null,"abstract":"<div><div>Digital compute-in-memory (DCIM) performs energy-efficient computation without accuracy loss, which has been proven to be a promising way to break the memory wall commonly existing in Transformer accelerators with von Neumann architecture. Approximate computing is also widely utilized to boost computation efficiency by exploiting error tolerance in neural networks. In this paper, we perform algorithm-hardware co-optimization to incorporate approximate multiplication into original full-precision DCIM, resulting in a more energy-efficient computing paradigm. First, a coarse-grained error compensation method is proposed to balance the error of partial product generation and partial product reduction, achieving almost zero mean error during multiplication operations. Secondly, a fine-grained error compensation is developed for accumulation operations, further suppressing the error of multiply-and-accumulate by 2-3 orders of magnitude. Additionally, based on the proposed approximate algorithm design, the structure of Static Random Access Memory (SRAM) cell is fully exploited to implement efficient approximate digital compute-in-memory (ADCIM), which can be scaled to different bit-widths. Finally, one value-adaptive error controller is utilized to match the error tolerance of the self-attention mechanism and enhance computation efficiency. The proposed ADCIM has been verified on Transformer models with different quantization precisions, and obtains peak energy efficiency of 14.91 tera-operations per second per watt (TOPS/W) @ 16-bit, 22.84 TOPS/W @ 12-bit, and 39.89 TOPS/W @ 8-bit, with negligible accuracy loss.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103512"},"PeriodicalIF":3.7,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144596255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lorenzo Carletti , Andrea Serafini , Gianluca Brilli , Alessandro Capotondi , Alessandro Biasci , Paolo Valente , Andrea Marongiu
{"title":"Taking a closer look at memory interference effects in commercial-off-the-shelf multicore SoCs","authors":"Lorenzo Carletti , Andrea Serafini , Gianluca Brilli , Alessandro Capotondi , Alessandro Biasci , Paolo Valente , Andrea Marongiu","doi":"10.1016/j.sysarc.2025.103487","DOIUrl":"10.1016/j.sysarc.2025.103487","url":null,"abstract":"<div><div>Commercial-off-the-shelf (COTS) multicore systems on chip (SoC) represent a cheap and convenient solution for deploying sophisticated workloads in various application domains. The combination of several CPU cores and dedicated acceleration units tightly sharing memory and interconnect systems can provide tremendous peak performance, but also threatens timing predictability due to memory interference. Even when focusing on main CPU cores only, it has been reported that task slowdown due to memory interference can surpass 10<span><math><mo>×</mo></math></span>. Such poorly predictable timing behaviors bar greater adoption of COTS multicore SoCs in the domain of timing-critical applications, and motivate the wide activity of the research community to study solutions aimed at mitigating the problem. Understanding worst-case interference patterns on such hardware platforms is fundamental for building any effective memory interference control mechanism. A common assumption in the literature is that worst-case interference is generated by (and therefore assessed through) read-intensive synthetic workloads with 100% cache miss rate. Yet certain real-life workloads exhibit worse slowdown than what is generated under said assumed worst-case, so we study the interference effects of both synthetic and real-life benchmarks on different multicore SoCs. Our experiments indicate that cache thrashing causes the worst interference experienced by real-life benchmarks – due to their different usage of caches – and that there is no universal worst-case workload for every platform.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103487"},"PeriodicalIF":3.7,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144580482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BoostTM: Best-effort performance guarantees in best-effort hardware transactional memory for distributed manycore architectures","authors":"Li Wan, Zhiyuan Zhang, Chao Fu, Qiang Li, Jun Han","doi":"10.1016/j.sysarc.2025.103481","DOIUrl":"10.1016/j.sysarc.2025.103481","url":null,"abstract":"<div><div>Concurrent access to shared data in multithreaded programming remains a performance bottleneck in Chip-Multiprocessor (CMP) systems. Best-effort Hardware Transactional Memory (HTM) offers a potential solution but faces critical constraints: frequent livelocks due to the requester-wins conflict strategy, inability to coexist with non-speculative fallback paths, and vulnerability to non-conflict-induced abort events, such as cache overflows and core exceptions. Mainstream CMP platforms, which typically feature out-of-order cores and distributed last-level caches (LLCs), introduce additional challenges for HTM optimization. This paper first formalizes these constraints and provides a theoretical performance analysis of our previous work, LockillerTM, highlighting its inherent advantages. We then introduce BoostTM, an enhanced version of LockillerTM designed for mainstream CMP systems. BoostTM incorporates design improvements to address the identified challenges and introduces a core exception handling mechanism to fill the gap left by LockillerTM in alleviating non-conflict-induced abort events. Finally, we extend the gem5 infrastructure to validate and evaluate BoostTM on a newly configured experimental platform with 32 out-of-order cores and distributed LLCs. Our evaluation demonstrates that BoostTM outperforms best-effort HTM, LockillerTM, and recent works—LosaTM-SAFU and CIT—with minimal overhead, providing a comprehensive understanding of the effectiveness and adaptability of each mechanism.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103481"},"PeriodicalIF":3.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144580473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yixuan Fang , Zhufang Kuang , Haobin Wang , Siyu Lin , Anfeng Liu
{"title":"Minimizing energy consumption of collaborative deployment and task offloading in two-tier UAV edge computing networks","authors":"Yixuan Fang , Zhufang Kuang , Haobin Wang , Siyu Lin , Anfeng Liu","doi":"10.1016/j.sysarc.2025.103511","DOIUrl":"10.1016/j.sysarc.2025.103511","url":null,"abstract":"<div><div>Multi-Unmanned Aerial Vehicle (UAV)-supported Mobile Edge Computing (MEC) can meet the computational requirements of tasks with high complexity and latency sensitivity to compensate for the lack of computational resources and coverage. In this paper, a multi-user and multi-UAV MEC networks is built as a two-tier UAV system in a task-intensive region where base stations are insufficient, with a centralized top-center UAV and a set of distributed bottom-UAVs providing computing services. The total energy consumption of the system is minimized by jointly optimizing the task offloading decision, 3D deployment of two-tier UAVs, the elevation angle of the bottom UAV, the number of UAVs, and computational resource allocation. To this end, an algorithm based on Differential Evolution and greedy algorithm with the objective of minimizing Energy Consumption (DEEC) is proposed in this paper. The algorithm uses a two-tier optimization framework where the upper tier uses a population optimization algorithm to solve for the location and elevation angle of the bottom UAV and the number of UAVs based on the actual ground equipment and the lower tier uses clustering and greedy algorithms to solve for the position of the top UAV, the offloading decision of the task, and the allocation of computational resources based on the results of the upper layer. The simulation results show that the algorithm effectively reduces the total energy consumption of the system while satisfying the task computation success rate and time delay.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103511"},"PeriodicalIF":3.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144549532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A transformation strategy for process partitioning in hierarchical concurrent process networks","authors":"Fahimeh Bahrami, Ingo Sander","doi":"10.1016/j.sysarc.2025.103509","DOIUrl":"10.1016/j.sysarc.2025.103509","url":null,"abstract":"<div><div>Concurrent process networks are a widely used parallel programming model for designing multiprocessor embedded systems, where system functionality is decomposed into processes that communicate via signals. These processes can be mapped onto different processing elements and executed concurrently. While the initial process network is designed to effectively capture high-level parallelism, it may not fully exploit the available parallelism. To enhance concurrency and balance workload distribution, process partitioning transformations are applied, restructuring process networks to expose finer-grained parallelism. The effectiveness of these transformations, however, depends on how well they align with the underlying hardware’s parallel capabilities.</div><div>A variety of partitioning transformations have been introduced for process networks constructed using <em>higher-order functions</em> in the form of <em>process constructors</em> and <em>data-parallel skeletons</em>. For such networks, algebraic laws of functions provide a principled foundation for defining transformation rules, enabling a systematic and non-ad-hoc approach to process network modification. However, selecting the most suitable transformation to optimize key performance metrics remains an open challenge. To address this, we propose a <em>transformation strategy</em> that systematically identifies the most effective partitioning transformations. Our approach introduces evaluation metrics and analytical models to assess the impact of parametric transformations across different configurations. We validate the proposed strategy through the transformation of two image processing algorithms, demonstrating that our analytical models correctly predict the most suitable transformations for enhancing parallelism and performance.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103509"},"PeriodicalIF":3.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144596256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DLLPM: Dual-layer location privacy matching in V2V energy trading","authors":"Saad Masood , Muneeb Ul Hassan , Pei-Wei Tsai , Jinjun Chen","doi":"10.1016/j.sysarc.2025.103507","DOIUrl":"10.1016/j.sysarc.2025.103507","url":null,"abstract":"<div><div>The recent increase in Electric Vehicles (EVs) on the road has highlighted privacy concerns, particularly in the Vehicle-to-Vehicle (V2V) energy trading scenario. Ensuring location privacy in Vehicular Ad Hoc Networks (VANETs) is crucial for user confidentiality. Existing privacy techniques in the V2V paradigm protect the location coordinates of the EVs, but privacy risks persist after EVs are matched. In this paper, we introduce a dual-layer location privacy matching (DLLPM) technique to enhance the privacy of V2V matching. Our approach utilizes Laplace differential privacy and partial homomorphic encryption, ensuring that the EV’s private data remains inaccessible to both participants and adversaries. We introduce a noise addition and clipping algorithm to obfuscate EV coordinates within a defined radius. Encrypted distance-based preference lists are generated using partial homomorphic encryption to establish differentially private stable matches. DLLPM ensures EV location privacy throughout the matching process and mitigates the risk of location privacy leakage even after suppliers and demanders exchange location information. Theoretical analysis and experimental results confirm the efficiency of DLLPM, demonstrating robust privacy preservation with a computational complexity of <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>log</mo><mi>n</mi><mi>⋅</mi><mrow><mo>(</mo><msub><mrow><mi>C</mi></mrow><mrow><mtext>enc</mtext></mrow></msub><mo>+</mo><msub><mrow><mi>C</mi></mrow><mrow><mtext>addHE</mtext></mrow></msub><mo>+</mo><msub><mrow><mi>C</mi></mrow><mrow><mtext>subHE</mtext></mrow></msub><mo>+</mo><msub><mrow><mi>C</mi></mrow><mrow><mtext>dec</mtext></mrow></msub><mo>)</mo></mrow><mo>)</mo></mrow></mrow></math></span>. We further evaluate computational performance using 128-bit and 256-bit encryption, showing that DLLPM achieves private and efficient matching in the V2V trading paradigm.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103507"},"PeriodicalIF":3.7,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144588014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sergio Mazzola , Gabriele Ara , Thomas Benz , Björn Forsberg , Tommaso Cucinotta , Luca Benini
{"title":"Data-driven power modeling and monitoring via hardware performance counter tracking","authors":"Sergio Mazzola , Gabriele Ara , Thomas Benz , Björn Forsberg , Tommaso Cucinotta , Luca Benini","doi":"10.1016/j.sysarc.2025.103504","DOIUrl":"10.1016/j.sysarc.2025.103504","url":null,"abstract":"<div><div>Energy-centric design is paramount in the current embedded computing era: use cases require increasingly high performance at an affordable power budget, often under real-time constraints. Hardware heterogeneity and parallelism help address the efficiency challenge, but greatly complicate online power consumption assessments, which are essential for dynamic hardware and software stack adaptations. We introduce a novel power modeling methodology with state-of-the-art accuracy, low overhead, and high responsiveness, whose implementation does not rely on microarchitectural details. Our methodology identifies the Performance Monitoring Counters (PMCs) with the highest linear correlation to the power consumption of each hardware sub-system, for each Dynamic Voltage and Frequency Scaling (DVFS) state. The individual, simple models are composed into a complete model that effectively describes the power consumption of the whole system, achieving high accuracy and low overhead. Our evaluation reports an average estimation error of 7.5% for power consumption and 1.3% for energy. We integrate these models in the Linux kernel with Runmeter, an open-source, PMC-based monitoring framework. Runmeter manages PMC sampling and processing, enabling the execution of our power models at runtime. With a worst-case time overhead of only 0.7%, Runmeter provides responsive and accurate power measurements directly in the kernel. This information can be employed for actuation policies in workload-aware DVFS and power-aware, closed-loop task scheduling.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103504"},"PeriodicalIF":3.7,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144513954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Temperature and deadline aware runtime resource management with workload prediction for heterogeneous multi-core platforms","authors":"Mina Niknafs, Petru Eles, Zebo Peng","doi":"10.1016/j.sysarc.2025.103506","DOIUrl":"10.1016/j.sysarc.2025.103506","url":null,"abstract":"<div><div>Contemporary embedded platforms necessitate advanced resource management techniques to effectively utilize their diverse computational resources. Usually these platforms encounter fluctuations in workloads, making workload prediction have the potential to enhance resource management efficiency. In addition, in modern multi-core systems, there is a discernible tendency for processing cores to decrease in size relative to power consumption. This reduction in core size contributes to higher power density within the chips, leading to elevated chip temperatures. Therefore, addressing the temperature issue becomes critical. This paper introduces a prediction-based and temperature-aware resource management heuristic designed to meet task deadlines, while simultaneously considering energy minimization. When evaluated on real-life workload traces, the proposed method achieves a 5.5% increase in acceptance rate with one-step-ahead prediction and an 8.9% increase with four-steps-ahead prediction in a temperature-aware context, compared to scenarios without prediction.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103506"},"PeriodicalIF":3.7,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144523542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient one-to-one sharing: Public key matchmaking encryption","authors":"Yunhao Ling , Guang Zhang , Jie Chen , Haifeng Qian","doi":"10.1016/j.sysarc.2025.103492","DOIUrl":"10.1016/j.sysarc.2025.103492","url":null,"abstract":"<div><div>Identity-Based Matchmaking Encryption (IB-ME) enables both the sender and the receiver to respectively specify an identity that the other party must satisfy, in order to reveal the messages. IB-ME is actually a one-to-one matchmaking encryption, and has many applications such as secure data sharing and non-interactive secret handshake protocol. However, the system requires a central authority to generate encryption keys and decryption keys for all users, which could lead to key escrow problem, single-point failure and performance bottleneck. The goal of this paper is to remove any authority from the system. We propose a matchmaking encryption in public-key setting, named Public Key Matchmaking Encryption (PK-ME). We give the formal syntax and security definition of PK-ME, present a lightweight PK-ME scheme, and formally prove its security in the random model. Finally, we conduct experiments to show the practicability of the scheme. In particular, compared to the related ME schemes, our encryption and decryption are very efficient, and our PK-ME scheme has shorter parameters.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103492"},"PeriodicalIF":3.7,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144502063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A survey on versatile embedded Machine Learning hardware acceleration","authors":"Pierre Garreau , Pascal Cotret , Julien Francq , Jean-Christophe Cexus , Loïc Lagadec","doi":"10.1016/j.sysarc.2025.103501","DOIUrl":"10.1016/j.sysarc.2025.103501","url":null,"abstract":"<div><div>This survey investigates recent developments in versatile embedded Machine Learning (ML) hardware acceleration. Various architectural approaches for efficient implementation of ML algorithms on resource-constrained devices are analyzed, focusing on three key aspects: performance optimization, embedded system considerations (throughput, latency, energy efficiency) and multi-application support. Nevertheless, it does not take into account attacks and defenses of ML architectures themselves. The survey then explores different hardware acceleration strategies, from custom RISC-V instructions to specialized Processing Elements (PEs), Processing-in-Memory (PiM) architectures and co-design approaches. Notable innovations include flexible bit-precision support, reconfigurable PEs, and optimal memory management techniques for reducing weights and (hyper)-parameters movements overhead. Subsequently, these architectures are evaluated based on the aforementioned key aspects. Our analysis shows that relevant and robust embedded ML acceleration requires careful consideration of the trade-offs between computational capability, power consumption, and architecture flexibility, depending on the application.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103501"},"PeriodicalIF":3.7,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144470141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}