基于检查点适应的高可用性分布式流计算系统

IF 1.5 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Concurrency and Computation-Practice & Experience Pub Date : 2025-06-21 DOI:10.1002/cpe.70171

Dawei Sun, Jia Peng, Ting Zhu, Jonathan Kua, Shang Gao, Rajkumar Buyya

{"title":"基于检查点适应的高可用性分布式流计算系统","authors":"Dawei Sun, Jia Peng, Ting Zhu, Jonathan Kua, Shang Gao, Rajkumar Buyya","doi":"10.1002/cpe.70171","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>The importance of fault tolerance strategies for distributed streaming computing systems becomes more evident due to the increased diversity of failures. Checkpointing is considered a general and efficient method for ensuring fault tolerance. However, determining the checkpoint interval poses a challenge: shorter checkpoint intervals lead to higher overhead, while longer intervals result in extended fault recovery time. Therefore, optimizing the checkpoint interval becomes crucial for the efficient operation of streaming applications. There has been relatively limited exploration and analysis of optimal checkpoint interval settings in the context of stream computing. Many existing works considered adjusting this interval based on a single factor. This article proposes a checkpoint adaptive strategy with high availability, named Ca-Stream. It considers multiple factors when adjusting checkpoint intervals. Specifically, it addresses the following aspects: (1) Using linear regression to predict the system's fault rate and dynamically adjusting the checkpoint interval based on these predictions. (2) Monitoring CPU time and memory consumption per task to dynamically trigger checkpoints, achieving high reliability, especially in resource-constrained scenarios. (3) Detecting task execution times on nodes and volume of input data for tasks to identify slow tasks within the cluster. Experiments conducted on a Flink system demonstrate Ca-Stream's benefits. It reduces checkpoint consumption time by over 38%, system recovery latency by 33%, CPU occupancy by up to 47%, and memory occupancy by 37% compared to Flink's approaches.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 15-17","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Toward High-Availability Distributed Stream Computing Systems via Checkpoint Adaptation\",\"authors\":\"Dawei Sun, Jia Peng, Ting Zhu, Jonathan Kua, Shang Gao, Rajkumar Buyya\",\"doi\":\"10.1002/cpe.70171\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>The importance of fault tolerance strategies for distributed streaming computing systems becomes more evident due to the increased diversity of failures. Checkpointing is considered a general and efficient method for ensuring fault tolerance. However, determining the checkpoint interval poses a challenge: shorter checkpoint intervals lead to higher overhead, while longer intervals result in extended fault recovery time. Therefore, optimizing the checkpoint interval becomes crucial for the efficient operation of streaming applications. There has been relatively limited exploration and analysis of optimal checkpoint interval settings in the context of stream computing. Many existing works considered adjusting this interval based on a single factor. This article proposes a checkpoint adaptive strategy with high availability, named Ca-Stream. It considers multiple factors when adjusting checkpoint intervals. Specifically, it addresses the following aspects: (1) Using linear regression to predict the system's fault rate and dynamically adjusting the checkpoint interval based on these predictions. (2) Monitoring CPU time and memory consumption per task to dynamically trigger checkpoints, achieving high reliability, especially in resource-constrained scenarios. (3) Detecting task execution times on nodes and volume of input data for tasks to identify slow tasks within the cluster. Experiments conducted on a Flink system demonstrate Ca-Stream's benefits. It reduces checkpoint consumption time by over 38%, system recovery latency by 33%, CPU occupancy by up to 47%, and memory occupancy by 37% compared to Flink's approaches.</p>\\n </div>\",\"PeriodicalId\":55214,\"journal\":{\"name\":\"Concurrency and Computation-Practice & Experience\",\"volume\":\"37 15-17\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2025-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Concurrency and Computation-Practice & Experience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70171\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70171","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

由于故障的多样性增加，容错策略对分布式流计算系统的重要性变得更加明显。检查点被认为是确保容错的一种通用而有效的方法。但是，确定检查点间隔会带来一个挑战：检查点间隔越短，开销越大，而检查点间隔越长，故障恢复时间就越长。因此，优化检查点间隔对于流应用程序的高效运行至关重要。在流计算的背景下，对最佳检查点间隔设置的探索和分析相对有限。许多现有的作品考虑根据单一因素调整这一间隔。本文提出了一种高可用性的检查点自适应策略，称为Ca-Stream。它在调整检查点间隔时考虑了多种因素。具体来说，它解决了以下几个方面：(1)利用线性回归预测系统的故障率，并根据这些预测动态调整检查点间隔。(2)监控每个任务的CPU时间和内存消耗，动态触发检查点，实现高可靠性，特别是在资源受限的场景下。(3)检测节点上的任务执行次数和任务输入数据量，识别集群内的慢任务。在Flink系统上进行的实验证明了Ca-Stream的优点。与Flink的方法相比，它将检查点消耗时间减少了38%以上，系统恢复延迟减少了33%，CPU占用率减少了47%，内存占用率减少了37%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Toward High-Availability Distributed Stream Computing Systems via Checkpoint Adaptation

The importance of fault tolerance strategies for distributed streaming computing systems becomes more evident due to the increased diversity of failures. Checkpointing is considered a general and efficient method for ensuring fault tolerance. However, determining the checkpoint interval poses a challenge: shorter checkpoint intervals lead to higher overhead, while longer intervals result in extended fault recovery time. Therefore, optimizing the checkpoint interval becomes crucial for the efficient operation of streaming applications. There has been relatively limited exploration and analysis of optimal checkpoint interval settings in the context of stream computing. Many existing works considered adjusting this interval based on a single factor. This article proposes a checkpoint adaptive strategy with high availability, named Ca-Stream. It considers multiple factors when adjusting checkpoint intervals. Specifically, it addresses the following aspects: (1) Using linear regression to predict the system's fault rate and dynamically adjusting the checkpoint interval based on these predictions. (2) Monitoring CPU time and memory consumption per task to dynamically trigger checkpoints, achieving high reliability, especially in resource-constrained scenarios. (3) Detecting task execution times on nodes and volume of input data for tasks to identify slow tasks within the cluster. Experiments conducted on a Flink system demonstrate Ca-Stream's benefits. It reduces checkpoint consumption time by over 38%, system recovery latency by 33%, CPU occupancy by up to 47%, and memory occupancy by 37% compared to Flink's approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Concurrency and Computation-Practice & Experience 工程技术-计算机：理论方法

CiteScore

5.00

自引率

10.00%

发文量

664

审稿时长

9.6 months

期刊介绍： Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of: Parallel and distributed computing; High-performance computing; Computational and data science; Artificial intelligence and machine learning; Big data applications, algorithms, and systems; Network science; Ontologies and semantics; Security and privacy; Cloud/edge/fog computing; Green computing; and Quantum computing.