Journal of Systems Architecture最新文献_第2页

Improved A* search: A bandwidth allocation algorithm for Linux traffic control based on Hierarchical Token Bucket 改进的A*搜索：基于分层令牌桶的Linux流量控制带宽分配算法

IF 3.7 2区计算机科学

Journal of Systems Architecture Pub Date : 2025-07-11 DOI: 10.1016/j.sysarc.2025.103461

Huiyao Xiao , Xi Jin , Zhiwei Feng , Qingxu Deng , Changqing Xia , Chi Xu

引用次数: 0

GHVSA: Graph-based high-dimensional vector search accelerator GHVSA：基于图的高维矢量搜索加速器

IF 3.7 2区计算机科学

Journal of Systems Architecture Pub Date : 2025-07-09 DOI: 10.1016/j.sysarc.2025.103514

Wei Yuan, Huawen Liang, Xi Jin

引用次数: 0

EdgeMark: An automation and benchmarking system for embedded artificial intelligence tools EdgeMark：嵌入式人工智能工具的自动化和基准测试系统

IF 3.7 2区计算机科学

Journal of Systems Architecture Pub Date : 2025-07-09 DOI: 10.1016/j.sysarc.2025.103488

Mohammad Amin Hasanpour , Mikkel Kirkegaard , Xenofon Fafoutis

{"title":"EdgeMark: An automation and benchmarking system for embedded artificial intelligence tools","authors":"Mohammad Amin Hasanpour , Mikkel Kirkegaard , Xenofon Fafoutis","doi":"10.1016/j.sysarc.2025.103488","DOIUrl":"10.1016/j.sysarc.2025.103488","url":null,"abstract":"<div><div>The integration of artificial intelligence (AI) into embedded devices, a paradigm known as embedded artificial intelligence (eAI) or tiny machine learning (TinyML), is transforming industries by enabling intelligent data processing at the edge. However, the many tools available in this domain leave researchers and developers wondering which one is best suited to their needs. This paper provides a review of existing eAI tools, highlighting their features, trade-offs, and limitations. Additionally, we introduce EdgeMark, an open-source automation system designed to streamline the workflow for deploying and benchmarking machine learning (ML) models on embedded platforms. EdgeMark simplifies model generation, optimization, conversion, and deployment while promoting modularity, reproducibility, and scalability. Experimental benchmarking results showcase the performance of widely used eAI tools, including TensorFlow Lite Micro (TFLM), Edge Impulse, Ekkono, and Renesas eAI Translator, across a wide range of models, revealing insights into their relative strengths and weaknesses. The findings provide guidance for researchers and developers in selecting the most suitable tools for specific application requirements, while EdgeMark lowers the barriers to adoption of eAI technologies.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103488"},"PeriodicalIF":3.7,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144632731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MCED-H: An efficient scheduling policy for energy harvesting mixed-criticality real-time systems MCED-H：能量收集混合临界实时系统的高效调度策略

IF 3.7 2区计算机科学

Journal of Systems Architecture Pub Date : 2025-07-09 DOI: 10.1016/j.sysarc.2025.103505

Mostafa Tamimipour , Hakem Beitollahi , Maryline Chetto

引用次数: 0

Reachability analysis of Hybrid Rebeca models 混合Rebeca模型的可达性分析

IF 3.7 2区计算机科学

Journal of Systems Architecture Pub Date : 2025-07-08 DOI: 10.1016/j.sysarc.2025.103493

Fatemeh Ghassemi , Saeed Zhiany , Nesa Abbasi , Ali Hodaei , Ali Ataollahi , József Kovács , Erika Ábrahám , Marjan Sirjani

引用次数: 0

ADCIM: scalable construction of approximate digital compute-in-memory MACRO for energy-efficient attention computation ADCIM：用于节能注意力计算的近似数字内存中计算宏的可扩展构造

IF 3.7 2区计算机科学

Journal of Systems Architecture Pub Date : 2025-07-05 DOI: 10.1016/j.sysarc.2025.103512

Xu Zhang , Yuan Cheng , Dingyang Zou , Ke Gu , Meiqi Wang , Zhongfeng Wang

{"title":"ADCIM: scalable construction of approximate digital compute-in-memory MACRO for energy-efficient attention computation","authors":"Xu Zhang , Yuan Cheng , Dingyang Zou , Ke Gu , Meiqi Wang , Zhongfeng Wang","doi":"10.1016/j.sysarc.2025.103512","DOIUrl":"10.1016/j.sysarc.2025.103512","url":null,"abstract":"<div><div>Digital compute-in-memory (DCIM) performs energy-efficient computation without accuracy loss, which has been proven to be a promising way to break the memory wall commonly existing in Transformer accelerators with von Neumann architecture. Approximate computing is also widely utilized to boost computation efficiency by exploiting error tolerance in neural networks. In this paper, we perform algorithm-hardware co-optimization to incorporate approximate multiplication into original full-precision DCIM, resulting in a more energy-efficient computing paradigm. First, a coarse-grained error compensation method is proposed to balance the error of partial product generation and partial product reduction, achieving almost zero mean error during multiplication operations. Secondly, a fine-grained error compensation is developed for accumulation operations, further suppressing the error of multiply-and-accumulate by 2-3 orders of magnitude. Additionally, based on the proposed approximate algorithm design, the structure of Static Random Access Memory (SRAM) cell is fully exploited to implement efficient approximate digital compute-in-memory (ADCIM), which can be scaled to different bit-widths. Finally, one value-adaptive error controller is utilized to match the error tolerance of the self-attention mechanism and enhance computation efficiency. The proposed ADCIM has been verified on Transformer models with different quantization precisions, and obtains peak energy efficiency of 14.91 tera-operations per second per watt (TOPS/W) @ 16-bit, 22.84 TOPS/W @ 12-bit, and 39.89 TOPS/W @ 8-bit, with negligible accuracy loss.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103512"},"PeriodicalIF":3.7,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144596255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Taking a closer look at memory interference effects in commercial-off-the-shelf multicore SoCs 仔细研究商用多核soc中的内存干扰效应

IF 3.7 2区计算机科学

Journal of Systems Architecture Pub Date : 2025-07-04 DOI: 10.1016/j.sysarc.2025.103487

Lorenzo Carletti , Andrea Serafini , Gianluca Brilli , Alessandro Capotondi , Alessandro Biasci , Paolo Valente , Andrea Marongiu

{"title":"Taking a closer look at memory interference effects in commercial-off-the-shelf multicore SoCs","authors":"Lorenzo Carletti , Andrea Serafini , Gianluca Brilli , Alessandro Capotondi , Alessandro Biasci , Paolo Valente , Andrea Marongiu","doi":"10.1016/j.sysarc.2025.103487","DOIUrl":"10.1016/j.sysarc.2025.103487","url":null,"abstract":"<div><div>Commercial-off-the-shelf (COTS) multicore systems on chip (SoC) represent a cheap and convenient solution for deploying sophisticated workloads in various application domains. The combination of several CPU cores and dedicated acceleration units tightly sharing memory and interconnect systems can provide tremendous peak performance, but also threatens timing predictability due to memory interference. Even when focusing on main CPU cores only, it has been reported that task slowdown due to memory interference can surpass 10<span><math><mo>×</mo></math></span>. Such poorly predictable timing behaviors bar greater adoption of COTS multicore SoCs in the domain of timing-critical applications, and motivate the wide activity of the research community to study solutions aimed at mitigating the problem. Understanding worst-case interference patterns on such hardware platforms is fundamental for building any effective memory interference control mechanism. A common assumption in the literature is that worst-case interference is generated by (and therefore assessed through) read-intensive synthetic workloads with 100% cache miss rate. Yet certain real-life workloads exhibit worse slowdown than what is generated under said assumed worst-case, so we study the interference effects of both synthetic and real-life benchmarks on different multicore SoCs. Our experiments indicate that cache thrashing causes the worst interference experienced by real-life benchmarks – due to their different usage of caches – and that there is no universal worst-case workload for every platform.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103487"},"PeriodicalIF":3.7,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144580482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BoostTM: Best-effort performance guarantees in best-effort hardware transactional memory for distributed manycore architectures BoostTM：在分布式多核体系结构的硬件事务内存中保证最佳性能

IF 3.7 2区计算机科学

Journal of Systems Architecture Pub Date : 2025-07-03 DOI: 10.1016/j.sysarc.2025.103481

Li Wan, Zhiyuan Zhang, Chao Fu, Qiang Li, Jun Han

{"title":"BoostTM: Best-effort performance guarantees in best-effort hardware transactional memory for distributed manycore architectures","authors":"Li Wan, Zhiyuan Zhang, Chao Fu, Qiang Li, Jun Han","doi":"10.1016/j.sysarc.2025.103481","DOIUrl":"10.1016/j.sysarc.2025.103481","url":null,"abstract":"<div><div>Concurrent access to shared data in multithreaded programming remains a performance bottleneck in Chip-Multiprocessor (CMP) systems. Best-effort Hardware Transactional Memory (HTM) offers a potential solution but faces critical constraints: frequent livelocks due to the requester-wins conflict strategy, inability to coexist with non-speculative fallback paths, and vulnerability to non-conflict-induced abort events, such as cache overflows and core exceptions. Mainstream CMP platforms, which typically feature out-of-order cores and distributed last-level caches (LLCs), introduce additional challenges for HTM optimization. This paper first formalizes these constraints and provides a theoretical performance analysis of our previous work, LockillerTM, highlighting its inherent advantages. We then introduce BoostTM, an enhanced version of LockillerTM designed for mainstream CMP systems. BoostTM incorporates design improvements to address the identified challenges and introduces a core exception handling mechanism to fill the gap left by LockillerTM in alleviating non-conflict-induced abort events. Finally, we extend the gem5 infrastructure to validate and evaluate BoostTM on a newly configured experimental platform with 32 out-of-order cores and distributed LLCs. Our evaluation demonstrates that BoostTM outperforms best-effort HTM, LockillerTM, and recent works—LosaTM-SAFU and CIT—with minimal overhead, providing a comprehensive understanding of the effectiveness and adaptability of each mechanism.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103481"},"PeriodicalIF":3.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144580473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Minimizing energy consumption of collaborative deployment and task offloading in two-tier UAV edge computing networks 两层无人机边缘计算网络中协同部署和任务卸载的能耗最小化

IF 3.7 2区计算机科学

Journal of Systems Architecture Pub Date : 2025-07-03 DOI: 10.1016/j.sysarc.2025.103511

Yixuan Fang , Zhufang Kuang , Haobin Wang , Siyu Lin , Anfeng Liu

{"title":"Minimizing energy consumption of collaborative deployment and task offloading in two-tier UAV edge computing networks","authors":"Yixuan Fang , Zhufang Kuang , Haobin Wang , Siyu Lin , Anfeng Liu","doi":"10.1016/j.sysarc.2025.103511","DOIUrl":"10.1016/j.sysarc.2025.103511","url":null,"abstract":"<div><div>Multi-Unmanned Aerial Vehicle (UAV)-supported Mobile Edge Computing (MEC) can meet the computational requirements of tasks with high complexity and latency sensitivity to compensate for the lack of computational resources and coverage. In this paper, a multi-user and multi-UAV MEC networks is built as a two-tier UAV system in a task-intensive region where base stations are insufficient, with a centralized top-center UAV and a set of distributed bottom-UAVs providing computing services. The total energy consumption of the system is minimized by jointly optimizing the task offloading decision, 3D deployment of two-tier UAVs, the elevation angle of the bottom UAV, the number of UAVs, and computational resource allocation. To this end, an algorithm based on Differential Evolution and greedy algorithm with the objective of minimizing Energy Consumption (DEEC) is proposed in this paper. The algorithm uses a two-tier optimization framework where the upper tier uses a population optimization algorithm to solve for the location and elevation angle of the bottom UAV and the number of UAVs based on the actual ground equipment and the lower tier uses clustering and greedy algorithms to solve for the position of the top UAV, the offloading decision of the task, and the allocation of computational resources based on the results of the upper layer. The simulation results show that the algorithm effectively reduces the total energy consumption of the system while satisfying the task computation success rate and time delay.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103511"},"PeriodicalIF":3.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144549532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A transformation strategy for process partitioning in hierarchical concurrent process networks 分层并发进程网络中进程划分的转换策略

IF 3.7 2区计算机科学

Journal of Systems Architecture Pub Date : 2025-07-03 DOI: 10.1016/j.sysarc.2025.103509

Fahimeh Bahrami, Ingo Sander

{"title":"A transformation strategy for process partitioning in hierarchical concurrent process networks","authors":"Fahimeh Bahrami, Ingo Sander","doi":"10.1016/j.sysarc.2025.103509","DOIUrl":"10.1016/j.sysarc.2025.103509","url":null,"abstract":"<div><div>Concurrent process networks are a widely used parallel programming model for designing multiprocessor embedded systems, where system functionality is decomposed into processes that communicate via signals. These processes can be mapped onto different processing elements and executed concurrently. While the initial process network is designed to effectively capture high-level parallelism, it may not fully exploit the available parallelism. To enhance concurrency and balance workload distribution, process partitioning transformations are applied, restructuring process networks to expose finer-grained parallelism. The effectiveness of these transformations, however, depends on how well they align with the underlying hardware’s parallel capabilities.</div><div>A variety of partitioning transformations have been introduced for process networks constructed using <em>higher-order functions</em> in the form of <em>process constructors</em> and <em>data-parallel skeletons</em>. For such networks, algebraic laws of functions provide a principled foundation for defining transformation rules, enabling a systematic and non-ad-hoc approach to process network modification. However, selecting the most suitable transformation to optimize key performance metrics remains an open challenge. To address this, we propose a <em>transformation strategy</em> that systematically identifies the most effective partitioning transformations. Our approach introduces evaluation metrics and analytical models to assess the impact of parametric transformations across different configurations. We validate the proposed strategy through the transformation of two image processing algorithms, demonstrating that our analytical models correctly predict the most suitable transformations for enhancing parallelism and performance.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103509"},"PeriodicalIF":3.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144596256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0