{"title":"A robust fault-tolerant framework for VM failure predication and efficient task scheduling in dynamic cloud environments","authors":"S. Sheeja Rani , Oruba Alfawaz , Ahmed M. Khedr","doi":"10.1016/j.jnca.2025.104340","DOIUrl":null,"url":null,"abstract":"<div><div>Due to the dynamic nature of cloud computing, maintaining fault-tolerance is essential to ensure the reliability and performance of virtualized environments. Failures in Virtual Machines (VMs) disrupt the seamless operation of cloud-based services, making it vital to implement a strong failure prediction system. As a solution, this work proposes a Segmented Regressive Learning-based Multivariate Raindrop Optimized Lottery Scheduling (SRL-MROLS) for dynamic cloud environments. Initially, the VM failure prediction is carried out using a Segmented Regressive Q-learning algorithm, where a set of VMs is provided as input. Segmented regression analyzes the average failure rate of VMs, while a reward-based framework guides the decision-making process for accurate failure prediction. Once a failure is predicted, a relocation process is triggered, involving the migration of workloads or tasks from the failing VM to an alternate VM. Next, a Multivariate Elitism Raindrop Optimization approach is employed to identify the optimal VM for task migration. Finally, a Deadline-Aware Stochastic Prioritized Lottery Scheduling is employed for efficient allocation of tasks to the selected VMs, maintaining seamless operations even in the event of VM failures. This process significantly improves task scheduling by maximizing throughput and minimizing response time in cloud environments. Experimental results demonstrate the superior performance of SRL-MROLS across different metrics. Specifically, it achieves an average improvement of 6.4% in failure prediction accuracy, 27.4% in throughput, and a 13% reduction in response time. Additionally, it reduces failure prediction time by 15%, migration cost by 14.3%, and makespan by 15%, significantly outperforming conventional techniques.</div></div>","PeriodicalId":54784,"journal":{"name":"Journal of Network and Computer Applications","volume":"244 ","pages":"Article 104340"},"PeriodicalIF":8.0000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Network and Computer Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1084804525002371","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Due to the dynamic nature of cloud computing, maintaining fault-tolerance is essential to ensure the reliability and performance of virtualized environments. Failures in Virtual Machines (VMs) disrupt the seamless operation of cloud-based services, making it vital to implement a strong failure prediction system. As a solution, this work proposes a Segmented Regressive Learning-based Multivariate Raindrop Optimized Lottery Scheduling (SRL-MROLS) for dynamic cloud environments. Initially, the VM failure prediction is carried out using a Segmented Regressive Q-learning algorithm, where a set of VMs is provided as input. Segmented regression analyzes the average failure rate of VMs, while a reward-based framework guides the decision-making process for accurate failure prediction. Once a failure is predicted, a relocation process is triggered, involving the migration of workloads or tasks from the failing VM to an alternate VM. Next, a Multivariate Elitism Raindrop Optimization approach is employed to identify the optimal VM for task migration. Finally, a Deadline-Aware Stochastic Prioritized Lottery Scheduling is employed for efficient allocation of tasks to the selected VMs, maintaining seamless operations even in the event of VM failures. This process significantly improves task scheduling by maximizing throughput and minimizing response time in cloud environments. Experimental results demonstrate the superior performance of SRL-MROLS across different metrics. Specifically, it achieves an average improvement of 6.4% in failure prediction accuracy, 27.4% in throughput, and a 13% reduction in response time. Additionally, it reduces failure prediction time by 15%, migration cost by 14.3%, and makespan by 15%, significantly outperforming conventional techniques.
期刊介绍:
The Journal of Network and Computer Applications welcomes research contributions, surveys, and notes in all areas relating to computer networks and applications thereof. Sample topics include new design techniques, interesting or novel applications, components or standards; computer networks with tools such as WWW; emerging standards for internet protocols; Wireless networks; Mobile Computing; emerging computing models such as cloud computing, grid computing; applications of networked systems for remote collaboration and telemedicine, etc. The journal is abstracted and indexed in Scopus, Engineering Index, Web of Science, Science Citation Index Expanded and INSPEC.