Syed Muhammad Abrar Akber, Hanhua Chen, Yonghui Wang, Hai Jin
{"title":"Minimizing Overheads of Checkpoints in Distributed Stream Processing Systems","authors":"Syed Muhammad Abrar Akber, Hanhua Chen, Yonghui Wang, Hai Jin","doi":"10.1109/CloudNet.2018.8549548","DOIUrl":null,"url":null,"abstract":"Failure occurrence in large-scale systems is inevitable, which makes the resilience a key challenge for modern systems. Checkpoints with rollback recovery is a well-known approach to provide fault tolerance in distributed systems. The checkpoint based fault tolerance approach periodically persists the application state to reliable storage, which serves as a recovery point in case of failure. These periodic checkpoints are not inline with the failure rate of the systems as many studies conclude that failure occurrence is not periodic. The optimal size of checkpoint interval is a crucial decision, which directly determines the checkpoint overheads. To minimize the checkpoint overheads, we propose to reduce the number of checkpoints during the application execution. We suggest reducing the number of checkpoints by successively increasing the checkpoint intervals. We consider the failure probability of the underlying infrastructure and iteratively increase the checkpoint intervals. The proposed checkpoint approach tailors the checkpoint initializing based on the failure probability. If failure probability is low, it increases the checkpoint interval, and eventually reduces the total number of checkpoints triggered during application timespan. Reducing the total number of checkpoints during application execution results in decreasing the checkpoint overheads. The experiment results show that the proposed checkpoint policy considerably reduces the checkpoint overheads as compared to periodic checkpoints.","PeriodicalId":436842,"journal":{"name":"2018 IEEE 7th International Conference on Cloud Networking (CloudNet)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 7th International Conference on Cloud Networking (CloudNet)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudNet.2018.8549548","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
Failure occurrence in large-scale systems is inevitable, which makes the resilience a key challenge for modern systems. Checkpoints with rollback recovery is a well-known approach to provide fault tolerance in distributed systems. The checkpoint based fault tolerance approach periodically persists the application state to reliable storage, which serves as a recovery point in case of failure. These periodic checkpoints are not inline with the failure rate of the systems as many studies conclude that failure occurrence is not periodic. The optimal size of checkpoint interval is a crucial decision, which directly determines the checkpoint overheads. To minimize the checkpoint overheads, we propose to reduce the number of checkpoints during the application execution. We suggest reducing the number of checkpoints by successively increasing the checkpoint intervals. We consider the failure probability of the underlying infrastructure and iteratively increase the checkpoint intervals. The proposed checkpoint approach tailors the checkpoint initializing based on the failure probability. If failure probability is low, it increases the checkpoint interval, and eventually reduces the total number of checkpoints triggered during application timespan. Reducing the total number of checkpoints during application execution results in decreasing the checkpoint overheads. The experiment results show that the proposed checkpoint policy considerably reduces the checkpoint overheads as compared to periodic checkpoints.