{"title":"一种用于EasyGrid MPI应用的混合容错方案","authors":"J. A. D. Silva, Vinod E. F. Rebello","doi":"10.1145/2089002.2089006","DOIUrl":null,"url":null,"abstract":"Writing applications capable of executing efficiently in distributed systems is extremely difficult and tedious for inexperienced users. The resources may be heterogeneous, non-dedicated, and offered without any performance or availability guarantees. Systems capable of adapting the execution of an application to these characteristics are essential. The EasyGrid Application Management System (AMS) transforms cluster-based MPI applications into autonomic ones capable executing robustly and efficiently in distributed environments. This work describes a strategy to endow these autonomic MPI applications with the property of self-healing and thus be capable of withstanding multiple simultaneous crash faults of processes and/or processors. The extremely low intrusion cost of the proposed hybrid solution might now facilitate acceptance of fault tolerance techniques in large scale high performance applications.","PeriodicalId":313448,"journal":{"name":"Middleware for Grid Computing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2011-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A hybrid fault tolerance scheme for EasyGrid MPI applications\",\"authors\":\"J. A. D. Silva, Vinod E. F. Rebello\",\"doi\":\"10.1145/2089002.2089006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Writing applications capable of executing efficiently in distributed systems is extremely difficult and tedious for inexperienced users. The resources may be heterogeneous, non-dedicated, and offered without any performance or availability guarantees. Systems capable of adapting the execution of an application to these characteristics are essential. The EasyGrid Application Management System (AMS) transforms cluster-based MPI applications into autonomic ones capable executing robustly and efficiently in distributed environments. This work describes a strategy to endow these autonomic MPI applications with the property of self-healing and thus be capable of withstanding multiple simultaneous crash faults of processes and/or processors. The extremely low intrusion cost of the proposed hybrid solution might now facilitate acceptance of fault tolerance techniques in large scale high performance applications.\",\"PeriodicalId\":313448,\"journal\":{\"name\":\"Middleware for Grid Computing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-12-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Middleware for Grid Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2089002.2089006\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Middleware for Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2089002.2089006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A hybrid fault tolerance scheme for EasyGrid MPI applications
Writing applications capable of executing efficiently in distributed systems is extremely difficult and tedious for inexperienced users. The resources may be heterogeneous, non-dedicated, and offered without any performance or availability guarantees. Systems capable of adapting the execution of an application to these characteristics are essential. The EasyGrid Application Management System (AMS) transforms cluster-based MPI applications into autonomic ones capable executing robustly and efficiently in distributed environments. This work describes a strategy to endow these autonomic MPI applications with the property of self-healing and thus be capable of withstanding multiple simultaneous crash faults of processes and/or processors. The extremely low intrusion cost of the proposed hybrid solution might now facilitate acceptance of fault tolerance techniques in large scale high performance applications.