{"title":"基于检查点适应的高可用性分布式流计算系统","authors":"Dawei Sun, Jia Peng, Ting Zhu, Jonathan Kua, Shang Gao, Rajkumar Buyya","doi":"10.1002/cpe.70171","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>The importance of fault tolerance strategies for distributed streaming computing systems becomes more evident due to the increased diversity of failures. Checkpointing is considered a general and efficient method for ensuring fault tolerance. However, determining the checkpoint interval poses a challenge: shorter checkpoint intervals lead to higher overhead, while longer intervals result in extended fault recovery time. Therefore, optimizing the checkpoint interval becomes crucial for the efficient operation of streaming applications. There has been relatively limited exploration and analysis of optimal checkpoint interval settings in the context of stream computing. Many existing works considered adjusting this interval based on a single factor. This article proposes a checkpoint adaptive strategy with high availability, named Ca-Stream. It considers multiple factors when adjusting checkpoint intervals. Specifically, it addresses the following aspects: (1) Using linear regression to predict the system's fault rate and dynamically adjusting the checkpoint interval based on these predictions. (2) Monitoring CPU time and memory consumption per task to dynamically trigger checkpoints, achieving high reliability, especially in resource-constrained scenarios. (3) Detecting task execution times on nodes and volume of input data for tasks to identify slow tasks within the cluster. Experiments conducted on a Flink system demonstrate Ca-Stream's benefits. It reduces checkpoint consumption time by over 38%, system recovery latency by 33%, CPU occupancy by up to 47%, and memory occupancy by 37% compared to Flink's approaches.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 15-17","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Toward High-Availability Distributed Stream Computing Systems via Checkpoint Adaptation\",\"authors\":\"Dawei Sun, Jia Peng, Ting Zhu, Jonathan Kua, Shang Gao, Rajkumar Buyya\",\"doi\":\"10.1002/cpe.70171\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>The importance of fault tolerance strategies for distributed streaming computing systems becomes more evident due to the increased diversity of failures. Checkpointing is considered a general and efficient method for ensuring fault tolerance. However, determining the checkpoint interval poses a challenge: shorter checkpoint intervals lead to higher overhead, while longer intervals result in extended fault recovery time. Therefore, optimizing the checkpoint interval becomes crucial for the efficient operation of streaming applications. There has been relatively limited exploration and analysis of optimal checkpoint interval settings in the context of stream computing. Many existing works considered adjusting this interval based on a single factor. This article proposes a checkpoint adaptive strategy with high availability, named Ca-Stream. It considers multiple factors when adjusting checkpoint intervals. Specifically, it addresses the following aspects: (1) Using linear regression to predict the system's fault rate and dynamically adjusting the checkpoint interval based on these predictions. (2) Monitoring CPU time and memory consumption per task to dynamically trigger checkpoints, achieving high reliability, especially in resource-constrained scenarios. (3) Detecting task execution times on nodes and volume of input data for tasks to identify slow tasks within the cluster. Experiments conducted on a Flink system demonstrate Ca-Stream's benefits. It reduces checkpoint consumption time by over 38%, system recovery latency by 33%, CPU occupancy by up to 47%, and memory occupancy by 37% compared to Flink's approaches.</p>\\n </div>\",\"PeriodicalId\":55214,\"journal\":{\"name\":\"Concurrency and Computation-Practice & Experience\",\"volume\":\"37 15-17\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2025-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Concurrency and Computation-Practice & Experience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70171\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70171","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
Toward High-Availability Distributed Stream Computing Systems via Checkpoint Adaptation
The importance of fault tolerance strategies for distributed streaming computing systems becomes more evident due to the increased diversity of failures. Checkpointing is considered a general and efficient method for ensuring fault tolerance. However, determining the checkpoint interval poses a challenge: shorter checkpoint intervals lead to higher overhead, while longer intervals result in extended fault recovery time. Therefore, optimizing the checkpoint interval becomes crucial for the efficient operation of streaming applications. There has been relatively limited exploration and analysis of optimal checkpoint interval settings in the context of stream computing. Many existing works considered adjusting this interval based on a single factor. This article proposes a checkpoint adaptive strategy with high availability, named Ca-Stream. It considers multiple factors when adjusting checkpoint intervals. Specifically, it addresses the following aspects: (1) Using linear regression to predict the system's fault rate and dynamically adjusting the checkpoint interval based on these predictions. (2) Monitoring CPU time and memory consumption per task to dynamically trigger checkpoints, achieving high reliability, especially in resource-constrained scenarios. (3) Detecting task execution times on nodes and volume of input data for tasks to identify slow tasks within the cluster. Experiments conducted on a Flink system demonstrate Ca-Stream's benefits. It reduces checkpoint consumption time by over 38%, system recovery latency by 33%, CPU occupancy by up to 47%, and memory occupancy by 37% compared to Flink's approaches.
期刊介绍:
Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of:
Parallel and distributed computing;
High-performance computing;
Computational and data science;
Artificial intelligence and machine learning;
Big data applications, algorithms, and systems;
Network science;
Ontologies and semantics;
Security and privacy;
Cloud/edge/fog computing;
Green computing; and
Quantum computing.