FP4S:基于片段的并行状态恢复，用于有状态流应用程序

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI:10.1109/IPDPS47924.2020.00116

Pinchao Liu, Hailu Xu, D. D. Silva, Qingyang Wang, Sarker Tanzir Ahmed, Liting Hu

{"title":"FP4S:基于片段的并行状态恢复，用于有状态流应用程序","authors":"Pinchao Liu, Hailu Xu, D. D. Silva, Qingyang Wang, Sarker Tanzir Ahmed, Liting Hu","doi":"10.1109/IPDPS47924.2020.00116","DOIUrl":null,"url":null,"abstract":"Streaming computations are by nature long-running. They run in highly dynamic distributed environments where many stream operators may leave or fail at the same time. Most of them are stateful, in which stream operators need to store and maintain large-sized state in memory, resulting in expensive time and space costs to recover them. The state-of-the-art stream processing systems offer failure recovery mainly through three approaches: replication recovery, checkpointing recovery, and DStream-based lineage recovery, which are either slow, resource-expensive or fail to handle many simultaneous failures.We present FP4S, a novel fragment-based parallel state recovery mechanism that can handle many simultaneous failures for a large number of concurrently running stream applications. The novelty of FP4S is that we organize all the application’s operators into a distributed hash table (DHT) based consistent ring to associate each operator with a unique set of neighbors. Then we divide each operator’s in-memory state into many fragments and periodically save them in each node’s neighbors, ensuring that different sets of available fragments can reconstruct lost state in parallel. This approach makes this failure recovery mechanism extremely scalable, and allows it to tolerate many simultaneous operator failures. We apply FP4S on Apache Storm and evaluate it using large-scale real-world experiments, which demonstrate its scalability, efficiency, and fast failure recovery features. When compared to the state-of-the-art solutions (Apache Storm), FP4S reduces 37.8% latency of state recovery and saves more than half of the hardware costs. It can scale to many simultaneous failures and successfully recover the states when up to 66.6% of states fail or get lost.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"51 1","pages":"1102-1111"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"FP4S: Fragment-based Parallel State Recovery for Stateful Stream Applications\",\"authors\":\"Pinchao Liu, Hailu Xu, D. D. Silva, Qingyang Wang, Sarker Tanzir Ahmed, Liting Hu\",\"doi\":\"10.1109/IPDPS47924.2020.00116\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Streaming computations are by nature long-running. They run in highly dynamic distributed environments where many stream operators may leave or fail at the same time. Most of them are stateful, in which stream operators need to store and maintain large-sized state in memory, resulting in expensive time and space costs to recover them. The state-of-the-art stream processing systems offer failure recovery mainly through three approaches: replication recovery, checkpointing recovery, and DStream-based lineage recovery, which are either slow, resource-expensive or fail to handle many simultaneous failures.We present FP4S, a novel fragment-based parallel state recovery mechanism that can handle many simultaneous failures for a large number of concurrently running stream applications. The novelty of FP4S is that we organize all the application’s operators into a distributed hash table (DHT) based consistent ring to associate each operator with a unique set of neighbors. Then we divide each operator’s in-memory state into many fragments and periodically save them in each node’s neighbors, ensuring that different sets of available fragments can reconstruct lost state in parallel. This approach makes this failure recovery mechanism extremely scalable, and allows it to tolerate many simultaneous operator failures. We apply FP4S on Apache Storm and evaluate it using large-scale real-world experiments, which demonstrate its scalability, efficiency, and fast failure recovery features. When compared to the state-of-the-art solutions (Apache Storm), FP4S reduces 37.8% latency of state recovery and saves more than half of the hardware costs. It can scale to many simultaneous failures and successfully recover the states when up to 66.6% of states fail or get lost.\",\"PeriodicalId\":6805,\"journal\":{\"name\":\"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"51 1\",\"pages\":\"1102-1111\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS47924.2020.00116\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS47924.2020.00116","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

流计算本质上是长时间运行的。它们运行在高度动态的分布式环境中，许多流操作符可能同时离开或失败。其中大多数是有状态的，流操作符需要在内存中存储和维护大容量的状态，这导致恢复它们需要花费昂贵的时间和空间成本。最先进的流处理系统主要通过三种方法提供故障恢复:复制恢复、检查点恢复和基于dstream的沿袭恢复，这些方法要么速度慢、资源昂贵，要么无法处理许多同时发生的故障。我们提出了一种新的基于碎片的并行状态恢复机制FP4S，它可以处理大量并发运行的流应用程序的许多并发故障。FP4S的新颖之处在于，我们将应用程序的所有操作符组织到一个基于分布式哈希表(DHT)的一致环中，以便将每个操作符与一组唯一的邻居关联起来。然后将每个操作符的内存状态划分为多个片段，周期性地保存在每个节点的邻居节点中，保证不同的可用片段集可以并行地重建丢失的状态。这种方法使故障恢复机制具有极大的可扩展性，并允许它容忍许多同时发生的操作人员故障。我们将FP4S应用于Apache Storm，并使用大规模的真实世界实验对其进行评估，证明了它的可扩展性，效率和快速故障恢复特性。与最先进的解决方案(Apache Storm)相比，FP4S减少了37.8%的状态恢复延迟，节省了一半以上的硬件成本。它可以扩展到许多同时发生的故障，并在高达66.6%的状态失败或丢失时成功恢复状态。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FP4S: Fragment-based Parallel State Recovery for Stateful Stream Applications

Streaming computations are by nature long-running. They run in highly dynamic distributed environments where many stream operators may leave or fail at the same time. Most of them are stateful, in which stream operators need to store and maintain large-sized state in memory, resulting in expensive time and space costs to recover them. The state-of-the-art stream processing systems offer failure recovery mainly through three approaches: replication recovery, checkpointing recovery, and DStream-based lineage recovery, which are either slow, resource-expensive or fail to handle many simultaneous failures.We present FP4S, a novel fragment-based parallel state recovery mechanism that can handle many simultaneous failures for a large number of concurrently running stream applications. The novelty of FP4S is that we organize all the application’s operators into a distributed hash table (DHT) based consistent ring to associate each operator with a unique set of neighbors. Then we divide each operator’s in-memory state into many fragments and periodically save them in each node’s neighbors, ensuring that different sets of available fragments can reconstruct lost state in parallel. This approach makes this failure recovery mechanism extremely scalable, and allows it to tolerate many simultaneous operator failures. We apply FP4S on Apache Storm and evaluate it using large-scale real-world experiments, which demonstrate its scalability, efficiency, and fast failure recovery features. When compared to the state-of-the-art solutions (Apache Storm), FP4S reduces 37.8% latency of state recovery and saves more than half of the hardware costs. It can scale to many simultaneous failures and successfully recover the states when up to 66.6% of states fail or get lost.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量