{"title":"Inference framework supporting parallel execution across heterogeneous accelerators","authors":"Philkyue Shin , Myungsun Kim , Seongsoo Hong","doi":"10.1016/j.sysarc.2025.103508","DOIUrl":"10.1016/j.sysarc.2025.103508","url":null,"abstract":"<div><div>The growing demand for on-device deep learning inference, particularly in latency-sensitive applications, has driven the adoption of heterogeneous accelerators that incorporate GPUs, DSPs, and NPUs. While these accelerators offer improved energy efficiency, their heterogeneity introduces significant programming complexity due to reliance on vendor-specific APIs. Existing deep learning inference frameworks, such as LiteRT, provide high-level APIs and support multiple backend APIs. However, they lack the ability to exploit parallel execution across heterogeneous accelerators. This paper introduces a novel inference framework that overcomes this limitation. Our framework utilizes a batch inference API to enable parallel execution across heterogeneous accelerators. The framework schedules heterogeneous accelerators to process batched inputs concurrently. To address the challenge of integrating commercial NPU APIs that do not support LiteRT, we develop a portable hooking engine. Furthermore, the framework mitigates accuracy inconsistencies arising from diverse quantization methods by dynamically adjusting postprocessing parameters to balance accuracy and latency. The proposed framework minimizes both average turnaround time and postprocessing time. Experimental results on a Qualcomm SA8195 SoC with Mobilint and Hailo NPUs demonstrate significant performance improvements compared to existing inference frameworks.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103508"},"PeriodicalIF":3.7,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144513953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangrong Xu , Liang Wang , Limin Xiao , Lei Liu , Zihao Zhou , Yuanqiu Lv , Li Ruan , Xilong Xie , Meng Han , Xiaojian Liao
{"title":"Exploiting intra-chip locality for multi-chip GPUs via two-level shared L1 cache","authors":"Xiangrong Xu , Liang Wang , Limin Xiao , Lei Liu , Zihao Zhou , Yuanqiu Lv , Li Ruan , Xilong Xie , Meng Han , Xiaojian Liao","doi":"10.1016/j.sysarc.2025.103500","DOIUrl":"10.1016/j.sysarc.2025.103500","url":null,"abstract":"<div><div>Remote memory accesses in multi-chip GPUs pose a major performance bottleneck due to high latency and inter-chip bandwidth contention. Exploiting intra-chip locality alleviates this bottleneck by serving memory accesses locally and reducing cross-chip traffic. Yet, conventional coarse-grained approaches to exploiting locality in multi-chip GPUs often incur excessive overhead, limiting their potential performance benefits. To this end, we propose TLS-Cache, a two-level shared L1 cache that efficiently exploits intra-chip locality without additional cache capacity. It mitigates high-latency remote memory accesses by enabling fine-grained data reuse through cluster-shared and remote-shared L1 caches, which capture locality within and across streaming multiprocessor clusters, respectively. These two caches work cooperatively to maximize the exploitation of intra-chip locality and deliver measurable performance gains. Experimental results show that TLS-Cache improves instructions per cycle by 30.2% on average, compared with the baseline 4-chip GPU with private L1 caches.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103500"},"PeriodicalIF":3.7,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144331190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shiqiang Nie , Chi Zhang , Menghan Li , Fangxing Yu , Yaming Li , Weiguo Wu
{"title":"ZoomDB: Building cost-effective key–value store engine on ZNS SSD and SMR HDD","authors":"Shiqiang Nie , Chi Zhang , Menghan Li , Fangxing Yu , Yaming Li , Weiguo Wu","doi":"10.1016/j.sysarc.2025.103465","DOIUrl":"10.1016/j.sysarc.2025.103465","url":null,"abstract":"<div><div>Log-Structured Merge tree (LSM-tree) based key–Value (KV) stores have become critical components in managing data for write-intensive cloud applications. With the explosive growth of unstructured data, emerging host-managed zoned storage solutions, such as high-performance Zoned NameSpace Solid State Drive (ZNS SSD) and large-capacity Shingled Magnetic Recording Hard Disk Drive (SMR HDD), present an ideal opportunity for efficient data storage. However, The state-of-the-art scheme partitions the LSM-tree on hybrid storage, placing lower levels on high-performance devices and higher levels on large-capacity devices, but it fails to address challenges in data layout and garbage collection on the hybrid storage system equipped with ZNS SSD and SMR HDD.</div><div>In this paper, we propose ZoomDB, an LSM-tree KV store engine designed around KV separation and tailored for hybrid zoned storage devices. First, we integrate KV separation with zone management in LSM-tree-based hybrid storage. Specifically, keys and low-level values are placed in high-performance zones on ZNS SSDs, while high-level values are stored in large-capacity zones on SMR HDDs, optimizing both performance and storage efficiency. To further enhance data management, we introduce a hotness identification mechanism that classifies values based on access frequency, storing hot and cold values in separate zones. Finally, we propose diversity GC tailored to zones with varying access frequencies, effectively reducing data migration overhead. We implement and evaluate ZoomDB on real ZNS SSD and SMR HDD. The evaluation results demonstrate that ZoomDB reduces the number of GC-triggered writes by 77.5% on average compared to WiscKey. It achieves throughput gains of 1.79<span><math><mo>×</mo></math></span> , 3.13<span><math><mo>×</mo></math></span> , 4.01<span><math><mo>×</mo></math></span> , 4.25<span><math><mo>×</mo></math></span> , and 4.32<span><math><mo>×</mo></math></span> over WiscKey+, WiscKey, GearDB, ZoneKV, and LevelDB, respectively.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103465"},"PeriodicalIF":3.7,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144331189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sheng-wei Xu , Shu-han Yu , Wan-Lu Liu , Zi-Yan Yue , Yi-Long Liu
{"title":"LB-CLBS: Lattice-based certificateless blind signature scheme for vehicle sensing within intelligent transportation","authors":"Sheng-wei Xu , Shu-han Yu , Wan-Lu Liu , Zi-Yan Yue , Yi-Long Liu","doi":"10.1016/j.sysarc.2025.103491","DOIUrl":"10.1016/j.sysarc.2025.103491","url":null,"abstract":"<div><div>In intelligent transportation, sensors installed on vehicles provide various intelligence services to relevant management departments by collecting road information and other sensing data. Government administrations use these data to provide convenient services to vehicle users and promote intelligent transportation development. However, as the importance of data continues to grow, the threats to the privacy of sensing data have increased dramatically. Malicious attackers can illegally obtain sensitive information about a vehicle, including speed, location, behavioral preferences and other data. Furthermore, the rise of quantum computing continues to pose a challenge to vehicle privacy data. Therefore, in this paper, we propose a new lattice-based certificateless blind signature (LB-CLBS) scheme using the module lattice to enhance vehicle privacy protection in intelligent transportation environments. Concretely, we use certificateless cryptography to construct a blind signature scheme based on the basic framework of Dilithium, which both ensures that the scheme is post-quantum and solves the key escrow problem in traditional cryptosystems. Based on the module version of Small Integer Solution (MSIS) and module version of Learning With Error (MLWE) hard problems, we prove that the LB-CLBS scheme is existential unforgeability under adaptively chosen message attacks (EUF-CMA) in the random oracle model. The performance evaluation shows that our scheme has an advantage over the previous scheme in every security performance. In addition, the computational efficiency of our scheme is improved by at least 70% compared with the previous schemes.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103491"},"PeriodicalIF":3.7,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144321802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiqiang Zhao , Xuexian Hu , Yining Liu , Jianghong Wei , Yuanjun Xia , Yangfan Liang
{"title":"SECP-AKE: Secure and efficient certificateless-password-based authenticated key exchange protocol for smart healthcare systems","authors":"Zhiqiang Zhao , Xuexian Hu , Yining Liu , Jianghong Wei , Yuanjun Xia , Yangfan Liang","doi":"10.1016/j.sysarc.2025.103485","DOIUrl":"10.1016/j.sysarc.2025.103485","url":null,"abstract":"<div><div>Due to the importance and sensitivity of medical data, the security protection and privacy preservation of the Healthcare Internet of Things (IoT) are current research hotspots. However, existing research schemes still suffer from incomplete security properties, imperfect authentication mechanisms, and inadequate privacy preservation. Therefore, this paper presents SECP-AKE, a secure and efficient certificateless-password-based authenticated key exchange protocol for IoT-based smart healthcare, which enables batch authentication, resists physical attacks, and provides strong anonymity. Specifically, using certificateless cryptography, the SECP-AKE protocol enables batch authentication of authorized users and devices while also resolving the key escrow problem. In particular, the SECP-AKE protocol incorporates Physical Unclonable Functions (PUFs) to resist physical attacks, thus enhancing device security and ensuring reliable medical service delivery. Additionally, the design of a pseudonym update mechanism can achieve user unlinkability, thereby providing enhanced privacy preservation. The results from both formal verification using SVO logic and informal security analyses demonstrate that the SECP-AKE protocol is secure and offers more comprehensive security properties. Meanwhile, the use of a well-known automated security verification tool Scyther further evaluates the protocol’s security reliability. Ultimately, comparative experiments on communication overhead and computational overhead demonstrate that the SECP-AKE protocol is efficient and feasible compared to state-of-the-art existing works.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103485"},"PeriodicalIF":3.7,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144335670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Is it worth the energy? An in-depth study on the energy efficiency of data augmentation strategies for finetuning-based low/few-shot object detection","authors":"Vladislav Li , Georgios Tsoumplekas , Ilias Siniosoglou , Panagiotis Sarigiannidis , Vasileios Argyriou","doi":"10.1016/j.sysarc.2025.103484","DOIUrl":"10.1016/j.sysarc.2025.103484","url":null,"abstract":"<div><div>Current methods for low- and few-shot object detection have primarily focused on enhancing model performance for detecting objects. One common approach to achieve this is by combining model finetuning with data augmentation strategies. However, little attention has been given to the energy efficiency of these approaches in data-scarce regimes. This paper seeks to conduct a comprehensive empirical study that examines both model performance and energy efficiency of custom data augmentations and automated data augmentation selection strategies when combined with a lightweight object detector. The methods are evaluated in four different benchmark datasets in terms of their performance and energy consumption, providing valuable insights regarding reaching an optimal tradeoff between these two objectives. Additionally, to better quantify this tradeoff, we propose a novel metric named modified Efficiency Factor that combines both of these conflicting objectives in a single metric and thus enables gaining insights into the effectiveness of the examined models and data augmentation strategies when considering both performance and efficiency. Consequently, it is shown that while some broader guidelines regarding appropriate data augmentation selections can be provided based on the obtained performance and energy efficiency results, in many cases, the performance gains of data augmentation strategies are overshadowed by their increased energy usage, necessitating the development of more energy-efficient data augmentation strategies to address data scarcity.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103484"},"PeriodicalIF":3.7,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144321989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lightweight and anonymous certificateless signcryption scheme for multi-receiver","authors":"Qingqing Xie, Liangqing Song","doi":"10.1016/j.sysarc.2025.103482","DOIUrl":"10.1016/j.sysarc.2025.103482","url":null,"abstract":"<div><div>This paper proposes an innovative certificateless signcryption scheme, which achieves lightweight computation and anonymity for both the sender and the receiver. By replacing the bilinear operation with elliptic curve scalar multiplication, the proposed scheme significantly reduces computational overhead, making it suitable for resource-limited devices. Furthermore, the scheme achieves the anonymity of both sender and receiver, by embedding the sender’s real identity within the set of disguises and concealing the receiver’s identity through pseudonyms. It also supports multiple receivers. It achieves a signcryption time of 1.134 ms, an unsigncryption time of 0.542 ms, and a ciphertext size of 280 bytes. Compared with some existing schemes that achieve sender or receiver anonymity and involve no pairing operations, the cost of signcryption and unsigncryption is reduced by 50% and 86% at most respectively. Through a formal security proof, we demonstrate that the proposed scheme ensures confidentiality and unforgeability.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103482"},"PeriodicalIF":3.7,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144321801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A comprehensive review on hardware implementations of lattice-based cryptographic schemes","authors":"Shaik Ahmadunnisa, Sudha Ellison Mathe","doi":"10.1016/j.sysarc.2025.103486","DOIUrl":"10.1016/j.sysarc.2025.103486","url":null,"abstract":"<div><div>The rise in threats from large-scale quantum computer has driven significant advancements in the field of Post Quantum Cryptography (PQC). In this context, the National Institute of Standards and Technology (NIST) has initiated a call to standardize PQC schemes. Among all the PQC schemes, lattice-based cryptography (LBC) is considered one of the most viable due to its robust security proofs and ease of implementation. In this paper, we survey the mathematical hardness of lattice-based schemes, and provide a comprehensive review of the existing hardware implementations for LBC schemes. Further, we also provide a review of the hardware optimization techniques involved in the existing designs. We give certain approaches for advancing our research to ensure an efficient and secure cryptosystem.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103486"},"PeriodicalIF":3.7,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144321803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea Serafini , Alessandro Biasci , Bruno Morelli , Paolo Valente , Andrea Marongiu
{"title":"Synchronous VS asynchronous reconfiguration of Memory Bandwidth Management Schemes: A comparative analysis","authors":"Andrea Serafini , Alessandro Biasci , Bruno Morelli , Paolo Valente , Andrea Marongiu","doi":"10.1016/j.sysarc.2025.103483","DOIUrl":"10.1016/j.sysarc.2025.103483","url":null,"abstract":"<div><div>Memory bandwidth contention may severely inflate the execution time of tasks co-running on modern Commercial Off-The-Shelf (COTS) multicores. An effective and widely deployed solution to mitigate the problem is <em>bandwidth regulation</em>, a technique to limit the available memory bandwidth for those cores that are not executing time-critical <em>tasks</em>. The granularity at which time-critical activities can be identified at the core level can be in fact even finer than a whole task, and just span smaller <em>memory-critical section</em> (MCS) therein. As the co-presence of MCS and non-critical task portions in the system dynamically changes over time, <em>bandwidth regulators</em> require more or less frequent <em>reconfiguration</em> of their parameters. Similar <em>reconfiguration techniques</em> thus represent a central component of dynamic <em>Memory Bandwidth Management Schemes</em> (MBMS). In particular, the overhead and latency of such a component determine the feasibility and control granularity of the overall bandwidth-regulation solution. The literature extensively covers low-level bandwidth regulation mechanisms and – to some extent – their integration in wider MBMSs, yet no in-depth analysis is currently available of the impact of <em>reconfiguration techniques</em>. This paper addresses this issue by proposing a comparative analysis of the two basic approaches to <em>reconfiguring</em> bandwidth regulators in a system: <em>synchronous</em> and <em>asynchronous</em> schemes. The analysis, performed on a real-world setup with both synthetic and real-world benchmarks, shows that the asynchronous technique improves the control granularity of a bandwidth regulator by a factor of up to 19x, moving from the <em>ms</em> to the <span><math><mi>μ</mi></math></span><em>s</em> scale.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103483"},"PeriodicalIF":3.7,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144313789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huan Li , Lunzhi Deng , Yating Gu , Na Wang , Yanli Chen , Siwei Li
{"title":"A secure data sharing scheme based on searchable public key encryption for authorized multi-receiver","authors":"Huan Li , Lunzhi Deng , Yating Gu , Na Wang , Yanli Chen , Siwei Li","doi":"10.1016/j.sysarc.2025.103489","DOIUrl":"10.1016/j.sysarc.2025.103489","url":null,"abstract":"<div><div>To ensure data confidentiality, data sharers usually choose to encrypt the data before uploading it to cloud storage. Data sharing is an important way to realize the value of data. Therefore, how to share encrypted data stored in the cloud among authorized users is a pressing issue that needs to be addressed. Public key encryption schemes with keyword search provide an effective solution to this problem. In this paper, we first analyze Yang et al.’ scheme Yang et al. (2023), and point out that the scheme does not realize the indistinguishability of ciphertext and trapdoor. Then, we propose a new data sharing scheme with searchable public key encryption for authorized multi-receiver (SDS-SPKE), which not only realizes the search function, but also realizes the key update, user revocation. Additionally, we provide the security proofs of the scheme, which reveals that our scheme realizes the indistinguishability of ciphertext and trapdoor, and solves the problem of single-key-exposure. Finally, we compare the performance of SDS-SPKE with five other searchable encryption schemes, and the experimental results show that our scheme offers superior efficiency.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103489"},"PeriodicalIF":3.7,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144307666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}