Bruno Coutinho, Dorgival Olavo Guedes Neto, Wagner Meira Jr, R. Ferreira
{"title":"过滤器标签流应用中的容错性","authors":"Bruno Coutinho, Dorgival Olavo Guedes Neto, Wagner Meira Jr, R. Ferreira","doi":"10.1109/SBAC-PAD.2007.31","DOIUrl":null,"url":null,"abstract":"Fault tolerance is a desirable feature in distributed high-performance systems, since applications tend to run for long periods of time and faults become more likely as the number of nodes in the system increase. However, most distributed environments lack any fault tolerant features, since they tend to be hard to implement and use, and often hurt performance dramatically. In this paper we discuss how we successfully added fault-tolerance to the Anthill distributed programming environment by using an application-level checkpoint/rollback solution. The programming model offers an abstraction where the programmer can easily identify points during the execution where the communication pattern is well defined, forming a consistent cut where checkpoints may be saved consistently without requiring extra communication, avoiding any domino effect during recovery from faults. We present the new abstractions for fault tolerance, describe how the solution was implemented and present performance results that show the efficiency of the solution with both regular and irregular applications.","PeriodicalId":261956,"journal":{"name":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fault-tolerance in filter-labeled-stream applications\",\"authors\":\"Bruno Coutinho, Dorgival Olavo Guedes Neto, Wagner Meira Jr, R. Ferreira\",\"doi\":\"10.1109/SBAC-PAD.2007.31\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fault tolerance is a desirable feature in distributed high-performance systems, since applications tend to run for long periods of time and faults become more likely as the number of nodes in the system increase. However, most distributed environments lack any fault tolerant features, since they tend to be hard to implement and use, and often hurt performance dramatically. In this paper we discuss how we successfully added fault-tolerance to the Anthill distributed programming environment by using an application-level checkpoint/rollback solution. The programming model offers an abstraction where the programmer can easily identify points during the execution where the communication pattern is well defined, forming a consistent cut where checkpoints may be saved consistently without requiring extra communication, avoiding any domino effect during recovery from faults. We present the new abstractions for fault tolerance, describe how the solution was implemented and present performance results that show the efficiency of the solution with both regular and irregular applications.\",\"PeriodicalId\":261956,\"journal\":{\"name\":\"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-11-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SBAC-PAD.2007.31\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD.2007.31","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Fault-tolerance in filter-labeled-stream applications
Fault tolerance is a desirable feature in distributed high-performance systems, since applications tend to run for long periods of time and faults become more likely as the number of nodes in the system increase. However, most distributed environments lack any fault tolerant features, since they tend to be hard to implement and use, and often hurt performance dramatically. In this paper we discuss how we successfully added fault-tolerance to the Anthill distributed programming environment by using an application-level checkpoint/rollback solution. The programming model offers an abstraction where the programmer can easily identify points during the execution where the communication pattern is well defined, forming a consistent cut where checkpoints may be saved consistently without requiring extra communication, avoiding any domino effect during recovery from faults. We present the new abstractions for fault tolerance, describe how the solution was implemented and present performance results that show the efficiency of the solution with both regular and irregular applications.