了解hadoop中计算节点相关故障的影响和含义

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2012-06-18 DOI:10.1145/2287076.2287108

Florin Dinu, T. Ng

{"title":"了解hadoop中计算节点相关故障的影响和含义","authors":"Florin Dinu, T. Ng","doi":"10.1145/2287076.2287108","DOIUrl":null,"url":null,"abstract":"Hadoop has become a critical component in today's cloud environment. Ensuring good performance for Hadoop is paramount for the wide-range of applications built on top of it. In this paper we analyze Hadoop's behavior under failures involving compute nodes. We find that even a single failure can result in inflated, variable and unpredictable job running times, all undesirable properties in a distributed system. We systematically track the causes underlying this distressing behavior. First, we find that Hadoop makes unrealistic assumptions about task progress rates. These assumptions can be easily invalidated by the cloud environment and, more surprisingly, by Hadoop's own design decisions. The result are significant inefficiencies in Hadoop's speculative execution algorithm. Second, failures are re-discovered individually by each task at the cost of great degradation in job running time. The reason is that Hadoop focuses on extreme scalability and thus trades off possible improvements resulting from sharing failure information between tasks. Third, Hadoop does not consider the causes of connection failures between its tasks. We show that the resulting overloading of connection failure semantics unnecessarily causes an otherwise localized failure to propagate to healthy tasks. We also discuss the implications of our findings and draw attention to new ways of improving Hadoop-like frameworks.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"23 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"81","resultStr":"{\"title\":\"Understanding the effects and implications of compute node related failures in hadoop\",\"authors\":\"Florin Dinu, T. Ng\",\"doi\":\"10.1145/2287076.2287108\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hadoop has become a critical component in today's cloud environment. Ensuring good performance for Hadoop is paramount for the wide-range of applications built on top of it. In this paper we analyze Hadoop's behavior under failures involving compute nodes. We find that even a single failure can result in inflated, variable and unpredictable job running times, all undesirable properties in a distributed system. We systematically track the causes underlying this distressing behavior. First, we find that Hadoop makes unrealistic assumptions about task progress rates. These assumptions can be easily invalidated by the cloud environment and, more surprisingly, by Hadoop's own design decisions. The result are significant inefficiencies in Hadoop's speculative execution algorithm. Second, failures are re-discovered individually by each task at the cost of great degradation in job running time. The reason is that Hadoop focuses on extreme scalability and thus trades off possible improvements resulting from sharing failure information between tasks. Third, Hadoop does not consider the causes of connection failures between its tasks. We show that the resulting overloading of connection failure semantics unnecessarily causes an otherwise localized failure to propagate to healthy tasks. We also discuss the implications of our findings and draw attention to new ways of improving Hadoop-like frameworks.\",\"PeriodicalId\":330072,\"journal\":{\"name\":\"IEEE International Symposium on High-Performance Parallel Distributed Computing\",\"volume\":\"23 4\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"81\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE International Symposium on High-Performance Parallel Distributed Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2287076.2287108\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Symposium on High-Performance Parallel Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2287076.2287108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 81

摘要

Hadoop已经成为当今云环境中的一个关键组件。确保Hadoop的良好性能对于构建在其上的广泛应用程序来说是至关重要的。本文分析了Hadoop在涉及计算节点的故障情况下的行为。我们发现，即使是单个故障也会导致作业运行时间膨胀、多变和不可预测，这些都是分布式系统中不希望出现的特性。我们系统地追踪这种令人痛苦的行为背后的原因。首先，我们发现Hadoop对任务进度率做出了不切实际的假设。这些假设很容易被云环境推翻，更令人惊讶的是，被Hadoop自己的设计决策推翻。其结果是Hadoop的推测执行算法效率低下。其次，故障由每个任务单独重新发现，代价是作业运行时间大大降低。原因是Hadoop专注于极端的可伸缩性，因此权衡了任务之间共享失败信息所带来的可能的改进。第三，Hadoop不考虑任务之间连接失败的原因。我们展示了由此产生的连接失败语义的重载不必要地导致原本本地化的失败传播到健康任务。我们还讨论了我们的发现的含义，并提请注意改进类hadoop框架的新方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Understanding the effects and implications of compute node related failures in hadoop

Hadoop has become a critical component in today's cloud environment. Ensuring good performance for Hadoop is paramount for the wide-range of applications built on top of it. In this paper we analyze Hadoop's behavior under failures involving compute nodes. We find that even a single failure can result in inflated, variable and unpredictable job running times, all undesirable properties in a distributed system. We systematically track the causes underlying this distressing behavior. First, we find that Hadoop makes unrealistic assumptions about task progress rates. These assumptions can be easily invalidated by the cloud environment and, more surprisingly, by Hadoop's own design decisions. The result are significant inefficiencies in Hadoop's speculative execution algorithm. Second, failures are re-discovered individually by each task at the cost of great degradation in job running time. The reason is that Hadoop focuses on extreme scalability and thus trades off possible improvements resulting from sharing failure information between tasks. Third, Hadoop does not consider the causes of connection failures between its tasks. We show that the resulting overloading of connection failure semantics unnecessarily causes an otherwise localized failure to propagate to healthy tasks. We also discuss the implications of our findings and draw attention to new ways of improving Hadoop-like frameworks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE International Symposium on High-Performance Parallel Distributed Computing

自引率

0.00%

发文量