什么时候启动容错保护是合适的?

2017 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2017-07-01 DOI:10.1109/HPCS.2017.70

Jorge Villamayor, Dolores Rexachs, E. Luque

{"title":"什么时候启动容错保护是合适的?","authors":"Jorge Villamayor, Dolores Rexachs, E. Luque","doi":"10.1109/HPCS.2017.70","DOIUrl":null,"url":null,"abstract":"In High Performance Computing, Fault Tolerance (FT) becomes a primary concern due to the constant growing and continuous aging of hardware components, which rise failures probability. Failures produce performance degradation to the environment and affect significantly users expected execution time. Rollback-Recovery protocols represent a fundamental component to protect and restore users parallel application execution, although this protection comes with an overhead. This paper proposes a First Protection Point model, which determines the starting point to introduce FT protection gaining benefits in terms of total execution time including failures. A characterization of Rollback-Recovery protocols applied on parallel applications is performed, to obtain key factors for the model design. This model can help users determine which checkpoints can be removed from the application execution when they are used for FT protection purposes, reducing the overhead and at the same time keeping high availability. An analytic model evaluation is developed to show the inflexion point where FT protection starts to provide benefits for users. Finally, three experimental environments are setup, using two private clusters and a public cluster configured in a well-known cloud Amazon EC2. A coordinated checkpoint facility is applied on NAS benchmark applications such as: CG, BT and LU to evaluate the proposed model, obtaining overhead impact reduction for provided Fault Tolerance.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"When is the Right Time to Start the Fault Tolerance Protection?\",\"authors\":\"Jorge Villamayor, Dolores Rexachs, E. Luque\",\"doi\":\"10.1109/HPCS.2017.70\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In High Performance Computing, Fault Tolerance (FT) becomes a primary concern due to the constant growing and continuous aging of hardware components, which rise failures probability. Failures produce performance degradation to the environment and affect significantly users expected execution time. Rollback-Recovery protocols represent a fundamental component to protect and restore users parallel application execution, although this protection comes with an overhead. This paper proposes a First Protection Point model, which determines the starting point to introduce FT protection gaining benefits in terms of total execution time including failures. A characterization of Rollback-Recovery protocols applied on parallel applications is performed, to obtain key factors for the model design. This model can help users determine which checkpoints can be removed from the application execution when they are used for FT protection purposes, reducing the overhead and at the same time keeping high availability. An analytic model evaluation is developed to show the inflexion point where FT protection starts to provide benefits for users. Finally, three experimental environments are setup, using two private clusters and a public cluster configured in a well-known cloud Amazon EC2. A coordinated checkpoint facility is applied on NAS benchmark applications such as: CG, BT and LU to evaluate the proposed model, obtaining overhead impact reduction for provided Fault Tolerance.\",\"PeriodicalId\":115758,\"journal\":{\"name\":\"2017 International Conference on High Performance Computing & Simulation (HPCS)\",\"volume\":\"112 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 International Conference on High Performance Computing & Simulation (HPCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCS.2017.70\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS.2017.70","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在高性能计算中，由于硬件组件的不断增长和不断老化，导致故障概率的增加，容错问题成为人们关注的焦点。故障会导致环境性能下降，并严重影响用户预期的执行时间。回滚恢复协议是保护和恢复用户并行应用程序执行的基本组件，尽管这种保护带来了开销。本文提出了一个第一保护点模型，该模型确定了引入FT保护的起点，从包括故障在内的总执行时间方面获得了好处。对应用于并行应用程序的回滚恢复协议进行了表征，以获得模型设计的关键因素。此模型可以帮助用户确定在将哪些检查点用于FT保护目的时可以从应用程序执行中删除，从而减少开销，同时保持高可用性。开发了一个分析模型评估，以显示FT保护开始为用户提供好处的拐点。最后，设置了三个实验环境，使用在知名云Amazon EC2中配置的两个私有集群和一个公共集群。在NAS基准测试应用程序(如CG、BT和LU)上应用了一个协调的检查点设施，以评估所建议的模型，通过提供容错来减少开销影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

When is the Right Time to Start the Fault Tolerance Protection?

In High Performance Computing, Fault Tolerance (FT) becomes a primary concern due to the constant growing and continuous aging of hardware components, which rise failures probability. Failures produce performance degradation to the environment and affect significantly users expected execution time. Rollback-Recovery protocols represent a fundamental component to protect and restore users parallel application execution, although this protection comes with an overhead. This paper proposes a First Protection Point model, which determines the starting point to introduce FT protection gaining benefits in terms of total execution time including failures. A characterization of Rollback-Recovery protocols applied on parallel applications is performed, to obtain key factors for the model design. This model can help users determine which checkpoints can be removed from the application execution when they are used for FT protection purposes, reducing the overhead and at the same time keeping high availability. An analytic model evaluation is developed to show the inflexion point where FT protection starts to provide benefits for users. Finally, three experimental environments are setup, using two private clusters and a public cluster configured in a well-known cloud Amazon EC2. A coordinated checkpoint facility is applied on NAS benchmark applications such as: CG, BT and LU to evaluate the proposed model, obtaining overhead impact reduction for provided Fault Tolerance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量