Bálint Siklósi , Pushpender K. Sharma , David J. Lusher , István Z. Reguly , Neil D. Sandham
{"title":"Reduced and mixed precision turbulent flow simulations using explicit finite difference schemes","authors":"Bálint Siklósi , Pushpender K. Sharma , David J. Lusher , István Z. Reguly , Neil D. Sandham","doi":"10.1016/j.future.2025.108111","DOIUrl":"10.1016/j.future.2025.108111","url":null,"abstract":"<div><div>The use of reduced and mixed precision computing has gained increasing attention in high-performance computing (HPC) as a means to improve computational efficiency, particularly on modern hardware architectures like GPUs. In this work, we explore the application of mixed precision arithmetic in compressible turbulent flow simulations using explicit finite difference schemes. We extend the OPS and OpenSBLI frameworks to support customizable precision levels, enabling fine-grained control over precision allocation for different computational tasks. Through a series of numerical experiments on the Taylor–Green vortex benchmark, we demonstrate that mixed precision strategies, such as half-single and single-double combinations, can offer significant performance gains without compromising numerical accuracy. However, pure half-precision computations result in unacceptable accuracy loss, underscoring the need for careful precision selection. Our results show that mixed precision configurations can reduce memory usage and communication overhead, leading to notable speedups, particularly on multi-CPU and multi-GPU systems.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108111"},"PeriodicalIF":6.2,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Management of autoscaling serverless functions in edge computing via Q-Learning","authors":"Priscilla Benedetti , Mauro Femminella , Gianluca Reali","doi":"10.1016/j.future.2025.108112","DOIUrl":"10.1016/j.future.2025.108112","url":null,"abstract":"<div><div>Serverless computing is a recently introduced deployment model to provide cloud services. The autoscaling of function instances allows adapting allocated resources to workload, so as to reduce latency and improve resource usage efficiency. However, autoscaling mechanisms could be affected by undesired ‘cold starts’ events, causing latency peaks due to spawning of new instances, which can be critical in edge deployments where applications are typically sensitive to latency. In order to regulate autoscaling of functions and mitigate the latency for accessing services, which may hinder the adoption of the serverless model in edge computing, we resort to the usage of reinforcement learning. Our experimental system is based on OpenFaaS, the most popular open-source Kubernetes-based serverless platform. In this system, we introduce a Q-Learning (QL) agent to dynamically configure the Kubernetes Horizontal Pod Autoscaler (HPA). This is accomplished via a QL model state space and a reward function definition that enforce service level agreement (SLA) compliance, in terms of latency, without allocating excessive resources. The agent is trained and tested using real serverless function invocation patterns, made available by Microsoft Azure. The experimental results show the benefits provided by the proposed solution over state-of-the-art in terms of compliance to the SLA, while limiting resource consumption and service request losses.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108112"},"PeriodicalIF":6.2,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Def-Ag: An energy-efficient decentralized federated learning framework via aggregator clients","authors":"Junyoung Park , Sungpil Woo , Joohyung Lee","doi":"10.1016/j.future.2025.108114","DOIUrl":"10.1016/j.future.2025.108114","url":null,"abstract":"<div><div>Federated Learning (FL) has revolutionized Artificial Intelligence (AI) by enabling decentralized model training across diverse datasets, thereby addressing privacy concerns. However, traditional FL relies on a centralized server, leading to latency, single-point failures, and trust issues. Decentralized Federated Learning (DFL) emerges as a promising solution, but it faces challenges in achieving optimal accuracy and convergence due to limited client interactions, requiring energy inefficiency. Moreover, balancing the personalization and generalization of the AI model in DFL remains a complex issue. To address those challenging problems, this paper presents Def-Ag, an innovative energy-efficient DFL framework utilizing aggregator clients within similarity-based clusters. To reduce this signaling overhead, a partial model information exchange is proposed in intra-cluster training. In addition, the knowledge distillation method is applied for inter-cluster training to carefully incorporate the knowledge between clusters. Finally, by integrating clustering-based hierarchical DFL and optimizing client selection, Def-Ag reduces energy consumption and communication overhead while balancing personalization and generalization. Extensive experiments on CIFAR-10 and FMNIST datasets confirm Def-Ag’s superior performance in reducing energy usage and maintaining learning accuracy compared to baseline methods. The results demonstrate that Def-Ag effectively balances personalization and generalization, providing a robust solution for energy-efficient decentralized federated learning systems.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108114"},"PeriodicalIF":6.2,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient two-stage computing method for large-scale research interest mining","authors":"Sha Yuan , Zhou Shao","doi":"10.1016/j.future.2025.108117","DOIUrl":"10.1016/j.future.2025.108117","url":null,"abstract":"<div><div>Semantic analysis for academic data is crucial for many scientific services, such as review recommendation, planning research funding directions. Research interest analysis faces challenges in large-scale academic data mining. Traditional methods of representing research interests, such as manual labeling, using statistical or machine learning methods, have limitations. In particular, the computation amount is unacceptable in large-scale multisource information integration. This paper presents an efficient computing method for predicting scholar interests based on the principle of large-scale recommendation systems, consisting of rough and refined sorting. In rough sorting, one-hot encoding, CHI square feature selection, TF-IDF feature extraction, and an SGD-based classifier are used to obtain several top interest labels. In refined sorting, a pre-trained SciBERT model outputs the optimal interest labels. The proposed approach offers two main advantages. Firstly, it improves computational efficiency, as directly using pre-trained models like BERT for large-scale data leads to excessive calculations. Secondly, the algorithm ensures better model performance. Feature selection in the rough sorting stage can avoid the negative impact of irrelevant papers on prediction precision, which is a problem when using pre-trained model directly.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108117"},"PeriodicalIF":6.2,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145057285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yinuo Fan , Dawei Sun , Minghui Wu , Shang Gao , Rajkumar Buyya
{"title":"A fine-grained task scheduling strategy for resource auto-scaling over fluctuating data streams","authors":"Yinuo Fan , Dawei Sun , Minghui Wu , Shang Gao , Rajkumar Buyya","doi":"10.1016/j.future.2025.108119","DOIUrl":"10.1016/j.future.2025.108119","url":null,"abstract":"<div><div>Resource scaling is crucial for stream computing systems in fluctuating data stream scenarios. Computational resource utilization fluctuates significantly with changes in data stream rates, often leading to pronounced issues of resource surplus and scarcity within these systems. Existing research has primarily focused on addressing resource insufficiency at runtime; however, effective solutions for handling variable data streams remain limited. Furthermore, overlooking task communication dependencies during task placement in resource adjustment may lead to increased communication cost, consequently impairing system performance. To address these challenges, we propose Ra-Stream, a fine-grained task scheduling strategy for resource auto-scaling over fluctuating data streams. Ra-Stream not only dynamically adjusts resources to accommodate varying data streams, but also employs fine-grained scheduling to optimize system performance further. This paper explains Ra-Stream through the following aspects: (1) Formalization: We formalize the application subgraph partitioning problem, the resource scaling problem and the task scheduling problem by constructing and analyzing a stream application model, a communication model, and a resource model. (2) Resource scaling and heuristic partitioning: We propose a resource scaling algorithm to scale computational resource for adapting to fluctuating data streams. A heuristic subgraph partitioning algorithm is also introduced to minimize communication cost evenly. (3) Fine-grained task scheduling: We present a fine-grained task scheduling algorithm to minimize computational resource utilization while reducing communication cost through thread-level task deployment. (4) Comprehensive evaluation: We evaluate multiple metrics, including latency, throughput and resource utilization in a real-world distributed stream computing environment. Experimental results demonstrate that, compared to state-of-the-art approaches, Ra-Stream reduces system latency by 36.37 % to 47.45 %, enhances system maximum throughput by 26.2 % to 60.55 %, and saves 40 % to 46.25 % in resource utilization.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108119"},"PeriodicalIF":6.2,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongmin Li , Xiurui Xie , Dongyang Zhang , Athanasios V. Vasilakos , Man-Fai Leung
{"title":"SEMQ: Efficient non-uniform quantization with sensitivity-based error minimization for large language models","authors":"Dongmin Li , Xiurui Xie , Dongyang Zhang , Athanasios V. Vasilakos , Man-Fai Leung","doi":"10.1016/j.future.2025.108120","DOIUrl":"10.1016/j.future.2025.108120","url":null,"abstract":"<div><div>Large Language Models (LLMs) represent a pivotal breakthrough in computational intelligence, showcasing exceptional capabilities in information aggregation and reasoning. However, their remarkable performance comes at the cost of ultra-high-scale parameters, leading to significant resource demands during deployment. Therefore, various model compression techniques have been developed, such as pruning, distillation, and quantization. Among these, quantization has gained prominence due to its ability to directly reduce the precision of model weights and activations, resulting in substantial memory savings and accelerated inference. Despite its advantages, existing quantization approaches face substantial challenges in ultra-low precisions (e.g., 2-bit), often resulting in severe performance degradation. To tackle this challenge, we propose a novel non-uniform quantization with minimal disturbance for LLM, which contains two innovations: (i) a Sensitivity-based Error Minimization Non-Uniform Quantization (SEMQ) algorithm, which finds the quantization scheme to minimize the quantization error through continuous iteration; and (ii) a Z-score-based method for outlier detection and isolation under the normal distribution assumption, reducing the complexity of the quantization process. The extensive experiments on the LLaMA family demonstrates that the proposed SEMQ enables the ultra-low precision quantization up to 2-bit, and 10<span><math><mo>×</mo></math></span> GPU memory reduction for origin LLMs while maintaining the model accuracy. Our code is publicly available at <span><span>https://github.com/ldm2060/semq</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108120"},"PeriodicalIF":6.2,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanfen Zhang , Longxin Zhang , Buqing Cao , Jing Liu , Wenyu Zhao , Jianguo Chen , Keqin Li
{"title":"Adaptive-oriented mutation snake optimizer for scheduling budget-constrained workflows in heterogeneous cloud environments","authors":"Yanfen Zhang , Longxin Zhang , Buqing Cao , Jing Liu , Wenyu Zhao , Jianguo Chen , Keqin Li","doi":"10.1016/j.future.2025.108118","DOIUrl":"10.1016/j.future.2025.108118","url":null,"abstract":"<div><div>Cloud computing, recognized as an advanced computing paradigm, facilitates flexible and efficient resource management and service delivery through virtualization and resource sharing. However, the computational capabilities of resources in heterogeneous cloud environments are often correlated with their costs; thus, budget constraints are imposed on users who require rapid response times. We introduce a novel metaheuristic optimization algorithm called the snake optimizer (SO), which is aimed at workflow scheduling in cloud environments, to tackle the challenge mentioned. We also integrate random mutation to enhance the algorithm’s global search capability to overcome the limitation of SO’s being prone to local optima. Additionally, we aim to increase the success rate of finding feasible solutions within budget constraints; thus, we implement a directional strategy to guide the evolutionary paths of the snake individuals. In this context, excessive randomness and overly rigid directionality can adversely affect the algorithm’s search performance. We propose an adaptive-oriented mutation (AOM) mechanism to balance the two aspects mentioned. This AOM mechanism is integrated with SO to create AOM-SO, which effectively addresses the makespan minimization problem for workflow scheduling under budget constraints in heterogeneous cloud environments. Comparative experiments using real-world scientific workflows show that AOM-SO achieves a 100 % success rate in identifying feasible solutions. Moreover, compared with the state-of-the-art algorithms, it reduces makespan by an average of 43.03 %.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108118"},"PeriodicalIF":6.2,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenshu Li , Jianhang Fei , Yongbing Jiang , Xiaoying Guo , Xiulin Geng , Xiaoyu He
{"title":"Dynamic spatio-temporal graph interaction attention network for traffic flow prediction","authors":"Wenshu Li , Jianhang Fei , Yongbing Jiang , Xiaoying Guo , Xiulin Geng , Xiaoyu He","doi":"10.1016/j.future.2025.108116","DOIUrl":"10.1016/j.future.2025.108116","url":null,"abstract":"<div><div>Against the backdrop of rapid urbanization, traffic flow prediction has become pivotal in urban transportation management and road planning. However, traffic data exhibits complex spatio-temporal dependencies, including long-term periodic trends and abrupt short-term fluctuations. Moreover, traffic patterns differ markedly across regions due to variations in geographic topology and the dynamic nature of inter-node interactions. To address these challenges, we propose a traffic flow prediction model based on a dynamic spatio-temporal graph interaction attention network (DynSTGIA). The model integrates a Time Fusion Attention (TFA) module to jointly capture localized short-term fluctuations and global long-term temporal dependencies, while a Memory-Guided Spatio-temporal Graph Module (MG-STM) incorporates learnable memory with multi-head attention to adaptively generate dynamic graphs and capture evolving spatial correlations. Moreover, to overcome the limitation of modality separation in traditional spatio-temporal models and enhance spatio-temporal fusion, we introduce an interaction learning mechanism that enables deep integration of temporal and spatial representations. Extensive experiments on five real-world traffic datasets demonstrate that DynSTGIA achieves up to 2.1 % MAE and 9.8 % RMSE improvements over strong baselines, confirming its superior performance across diverse traffic scenarios.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108116"},"PeriodicalIF":6.2,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Benchmarking a DNN for aortic valve calcium lesions segmentation on FPGA-based DPU using the vitis AI toolchain","authors":"Valentina Sisini , Andrea Miola , Giada Minghini , Enrico Calore , Armando Ugo Cavallo , Sebastiano Fabio Schifano , Cristian Zambelli","doi":"10.1016/j.future.2025.108115","DOIUrl":"10.1016/j.future.2025.108115","url":null,"abstract":"<div><div>Semantic segmentation assigns a class to every pixel of an image to automatically locate objects in the context of computer vision applications for autonomous vehicles, robotics, agriculture, gaming, and medical imaging. Deep Neural Network models, such as Convolutional Neural Networks (CNNs), are widely used for this purpose. Among the plethora of models, the U-Net is a standard in biomedical imaging. Nowadays, GPUs efficiently perform segmentation and are the reference architectures for running CNNs, and FPGAs compete for inferences among alternative platforms, promising higher energy efficiency and lower latency solutions. In this contribution, we evaluate the performance of FPGA-based Deep Processing Units (DPUs) implemented on the AMD Alveo U55C for the inference task, using calcium segmentation in cardiac aortic valve computer tomography scans as a benchmark. We design and implement a U-Net-based application, optimize the hyperparameters to maximize the prediction accuracy, perform pruning to simplify the model, and use different numerical quantizations to exploit low-precision operations supported by the DPUs and GPUs to boost the computation time. We describe how to port and deploy the U-Net model on DPUs, and we compare accuracy, throughput, and energy efficiency achieved with four generations of GPUs and a recent dual 32-core high-end CPU platform. Our results show that a complex DNN like the U-Net can run effectively on DPUs using 8-bit integer computation, achieving a prediction accuracy of approximately <span><math><mrow><mn>95</mn><mspace></mspace><mo>%</mo></mrow></math></span> in Dice and <span><math><mrow><mn>91</mn><mspace></mspace><mo>%</mo></mrow></math></span> in IoU scores. These results are comparable to those measured when running the floating-point models on GPUs and CPUs. On the one hand, in terms of computing performance, the DPUs achieves a inference latency of approximately 3.5 ms and a throughput of approximately 4.2 kPFS, boosting the performance of a 64-core CPU system by approximately <span><math><mrow><mn>10</mn><mspace></mspace><mo>%</mo></mrow></math></span> in terms of latency and a factor <span><math><mrow><mn>2</mn><mi>X</mi></mrow></math></span> in terms of throughput, but still do not overcoming the performance of GPUs when using the same numerical precision. On the other hand, considering the energy efficiency, the improvements are approximately a factor <span><math><mrow><mn>6.7</mn><mi>X</mi></mrow></math></span> compared to the CPU, and <span><math><mrow><mn>1.6</mn><mi>X</mi></mrow></math></span> compared to the P100 GPU manufactured with the same technological process (16 nm).</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108115"},"PeriodicalIF":6.2,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal repair and load balance in locally repairable codes: Design and evaluation","authors":"Ximeng Chen , Si Wu , Hao Zhao , Yinlong Xu","doi":"10.1016/j.future.2025.108113","DOIUrl":"10.1016/j.future.2025.108113","url":null,"abstract":"<div><div>Erasure coding is increasingly deployed in modern clustered storage systems to provide low-cost reliable storage. In particular, Locally Repairable Codes (LRCs) are a popular family of repair-efficient erasure codes that receive wide deployment in practice. In this paper, we analyze the storage process formulated as a data partitioning phase plus a node selection phase for LRCs in clustered storage systems. We show that the conventional flat partitioning and random partitioning incur significant cross-cluster repair traffic, while the random node selection causes storage and network imbalance. To this end, we design a new storage scheme composed of an optimal partitioning strategy and an enhanced node selection strategy for LRCs. Our partitioning strategy minimizes the cross-cluster repair traffic by dividing each group of blocks into the minimum number of clusters and further compactly placing the blocks. Our node selection strategy improves load balance by choosing less-loaded clusters and nodes to store blocks with potential higher access frequency at higher priority. To accommodate access fluctuations, we enhance our storage scheme with a rebalancing strategy that restores storage and network balance at both the cluster and node levels. We implement our storage scheme on a key-value store prototype atop Memcached. Evaluation on a LAN testbed shows that our scheme greatly improves the repair performance and load balance ratio compared to the baseline.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"175 ","pages":"Article 108113"},"PeriodicalIF":6.2,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}