Minimizing Overheads of Checkpoints in Distributed Stream Processing Systems

2018 IEEE 7th International Conference on Cloud Networking (CloudNet) Pub Date : 2018-10-01 DOI:10.1109/CloudNet.2018.8549548

Syed Muhammad Abrar Akber, Hanhua Chen, Yonghui Wang, Hai Jin

{"title":"Minimizing Overheads of Checkpoints in Distributed Stream Processing Systems","authors":"Syed Muhammad Abrar Akber, Hanhua Chen, Yonghui Wang, Hai Jin","doi":"10.1109/CloudNet.2018.8549548","DOIUrl":null,"url":null,"abstract":"Failure occurrence in large-scale systems is inevitable, which makes the resilience a key challenge for modern systems. Checkpoints with rollback recovery is a well-known approach to provide fault tolerance in distributed systems. The checkpoint based fault tolerance approach periodically persists the application state to reliable storage, which serves as a recovery point in case of failure. These periodic checkpoints are not inline with the failure rate of the systems as many studies conclude that failure occurrence is not periodic. The optimal size of checkpoint interval is a crucial decision, which directly determines the checkpoint overheads. To minimize the checkpoint overheads, we propose to reduce the number of checkpoints during the application execution. We suggest reducing the number of checkpoints by successively increasing the checkpoint intervals. We consider the failure probability of the underlying infrastructure and iteratively increase the checkpoint intervals. The proposed checkpoint approach tailors the checkpoint initializing based on the failure probability. If failure probability is low, it increases the checkpoint interval, and eventually reduces the total number of checkpoints triggered during application timespan. Reducing the total number of checkpoints during application execution results in decreasing the checkpoint overheads. The experiment results show that the proposed checkpoint policy considerably reduces the checkpoint overheads as compared to periodic checkpoints.","PeriodicalId":436842,"journal":{"name":"2018 IEEE 7th International Conference on Cloud Networking (CloudNet)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 7th International Conference on Cloud Networking (CloudNet)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudNet.2018.8549548","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Failure occurrence in large-scale systems is inevitable, which makes the resilience a key challenge for modern systems. Checkpoints with rollback recovery is a well-known approach to provide fault tolerance in distributed systems. The checkpoint based fault tolerance approach periodically persists the application state to reliable storage, which serves as a recovery point in case of failure. These periodic checkpoints are not inline with the failure rate of the systems as many studies conclude that failure occurrence is not periodic. The optimal size of checkpoint interval is a crucial decision, which directly determines the checkpoint overheads. To minimize the checkpoint overheads, we propose to reduce the number of checkpoints during the application execution. We suggest reducing the number of checkpoints by successively increasing the checkpoint intervals. We consider the failure probability of the underlying infrastructure and iteratively increase the checkpoint intervals. The proposed checkpoint approach tailors the checkpoint initializing based on the failure probability. If failure probability is low, it increases the checkpoint interval, and eventually reduces the total number of checkpoints triggered during application timespan. Reducing the total number of checkpoints during application execution results in decreasing the checkpoint overheads. The experiment results show that the proposed checkpoint policy considerably reduces the checkpoint overheads as compared to periodic checkpoints.

查看原文本刊更多论文

最小化分布式流处理系统中检查点的开销

在大型系统中，故障的发生是不可避免的，这使得弹性成为现代系统面临的一个关键挑战。带有回滚恢复的检查点是在分布式系统中提供容错的一种众所周知的方法。基于检查点的容错方法周期性地将应用程序状态持久化到可靠的存储中，在发生故障时充当恢复点。这些周期性检查点与系统的故障率不一致，因为许多研究得出的结论是，故障的发生不是周期性的。检查点间隔的最优大小是一个至关重要的决策，它直接决定了检查点开销。为了最小化检查点开销，我们建议在应用程序执行期间减少检查点的数量。我们建议通过连续增加检查点间隔来减少检查点的数量。我们考虑底层基础设施的故障概率，并迭代地增加检查点间隔。提出的检查点方法根据故障概率调整检查点初始化。如果失败概率较低，则增加检查点间隔，并最终减少在应用程序时间范围内触发的检查点总数。在应用程序执行期间减少检查点的总数可以减少检查点开销。实验结果表明，与定期检查点相比，所提出的检查点策略显著降低了检查点开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE 7th International Conference on Cloud Networking (CloudNet)

自引率

0.00%

发文量