Xiaohui Gu, S. Papadimitriou, Philip S. Yu, Shu-Ping Chang
{"title":"Toward Predictive Failure Management for Distributed Stream Processing Systems","authors":"Xiaohui Gu, S. Papadimitriou, Philip S. Yu, Shu-Ping Chang","doi":"10.1109/ICDCS.2008.34","DOIUrl":null,"url":null,"abstract":"Distributed stream processing systems (DSPSs) have many important applications such as sensor data analysis, network security, and business intelligence. Failure management is essential for DSPSs that often require highly-available system operations. In this paper, we explore a new predictive failure management approach that employs online failure prediction to achieve more efficient failure management than previous reactive or proactive failure management approaches. We employ light-weight stream-based classification methods to perform online failure forecast. Based on the prediction results, the system can take differentiated failure preventions on abnormal components only. Our failure prediction model is tunable, which can achieve a desired tradeoff between failure penalty reduction and prevention cost based on a user-defined reward function. To achieve low-overhead online learning, we propose adaptive data stream sampling schemes to adaptively adjust measurement sampling rates based on the states of monitored components, and maintain a limited size of historical training data using reservoir sampling. We have implemented an initial prototype of the predictive failure management framework within the IBM System S distributed stream processing system. Experiment results show that our system can achieve more efficient failure management than conventional reactive and proactive approaches, while imposing low overhead to the DSPS.","PeriodicalId":240205,"journal":{"name":"2008 The 28th International Conference on Distributed Computing Systems","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 The 28th International Conference on Distributed Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.2008.34","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 35
Abstract
Distributed stream processing systems (DSPSs) have many important applications such as sensor data analysis, network security, and business intelligence. Failure management is essential for DSPSs that often require highly-available system operations. In this paper, we explore a new predictive failure management approach that employs online failure prediction to achieve more efficient failure management than previous reactive or proactive failure management approaches. We employ light-weight stream-based classification methods to perform online failure forecast. Based on the prediction results, the system can take differentiated failure preventions on abnormal components only. Our failure prediction model is tunable, which can achieve a desired tradeoff between failure penalty reduction and prevention cost based on a user-defined reward function. To achieve low-overhead online learning, we propose adaptive data stream sampling schemes to adaptively adjust measurement sampling rates based on the states of monitored components, and maintain a limited size of historical training data using reservoir sampling. We have implemented an initial prototype of the predictive failure management framework within the IBM System S distributed stream processing system. Experiment results show that our system can achieve more efficient failure management than conventional reactive and proactive approaches, while imposing low overhead to the DSPS.
分布式流处理系统在传感器数据分析、网络安全、商业智能等方面有着重要的应用。故障管理对于通常需要高可用性系统操作的dsp是必不可少的。在本文中,我们探索了一种新的预测性故障管理方法,该方法采用在线故障预测来实现比以前的被动或主动故障管理方法更有效的故障管理。我们采用轻量级的基于流的分类方法来进行在线故障预测。根据预测结果,系统可对异常部件采取差异化的故障预防措施。我们的故障预测模型是可调的,它可以基于用户自定义的奖励函数在故障惩罚减少和预防成本之间实现理想的权衡。为了实现低开销的在线学习,我们提出了自适应数据流采样方案,根据被监测组件的状态自适应调整测量采样率,并使用储层采样保持有限规模的历史训练数据。我们已经在IBM System S分布式流处理系统中实现了预测故障管理框架的初始原型。实验结果表明,与传统的被动和主动方法相比,该系统可以实现更有效的故障管理,同时降低了dsp的开销。