{"title":"Detecting and mitigating faults in cloud computing environment","authors":"M. K. Gokhroo, M. C. Govil, E. Pilli","doi":"10.1109/CIACT.2017.7977362","DOIUrl":null,"url":null,"abstract":"Distributed Systems have swiftly evolved from network of personal computers to cluster and then to grid, moving on to the era of cloud computing and now the latest one as Internet of things (IoT). With these rapid enhancements, the scale and complexity of systems providing cloud computing services have also increased tremendously. The major challenge faced by cloud service providers today is to provide an efficient, cost-effective, and reliable solution for seamless delivery of services to users. To achieve this research community is constantly working hard on different related issues like scheduling, power consumption, high availability, customer retention, resource provisioning, reliability and minimizing the probability of failures, etc. Reliability of service is an important parameter. With a large number of components in the cloud, the probability of failures is becoming a norm rather than an exception while delivering services to users. This emphasizes the need to develop fault tolerant schemes for cloud environment to deliver the required level of reliability. In this work, we have proposed a novel fault detection and mitigation approach. The novelty of approach lies in the method of detecting the fault based on running status of the job. The detection algorithm periodically monitors the progress of job on virtual machines (VMs) and reports the stalled job due to failed VM to fault tolerant manager (FTM). This not only reduces the resources wastage but ensures timely delivery of services to avoid any penalty due to service level agreement (SLA) violation. The validation of the proposed approach is done using CloudSim simulator. The performance analysis reveals the effectiveness of the proposed approach.","PeriodicalId":218079,"journal":{"name":"2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIACT.2017.7977362","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18
Abstract
Distributed Systems have swiftly evolved from network of personal computers to cluster and then to grid, moving on to the era of cloud computing and now the latest one as Internet of things (IoT). With these rapid enhancements, the scale and complexity of systems providing cloud computing services have also increased tremendously. The major challenge faced by cloud service providers today is to provide an efficient, cost-effective, and reliable solution for seamless delivery of services to users. To achieve this research community is constantly working hard on different related issues like scheduling, power consumption, high availability, customer retention, resource provisioning, reliability and minimizing the probability of failures, etc. Reliability of service is an important parameter. With a large number of components in the cloud, the probability of failures is becoming a norm rather than an exception while delivering services to users. This emphasizes the need to develop fault tolerant schemes for cloud environment to deliver the required level of reliability. In this work, we have proposed a novel fault detection and mitigation approach. The novelty of approach lies in the method of detecting the fault based on running status of the job. The detection algorithm periodically monitors the progress of job on virtual machines (VMs) and reports the stalled job due to failed VM to fault tolerant manager (FTM). This not only reduces the resources wastage but ensures timely delivery of services to avoid any penalty due to service level agreement (SLA) violation. The validation of the proposed approach is done using CloudSim simulator. The performance analysis reveals the effectiveness of the proposed approach.