Amin Hassani, A. Skjellum, P. Bangalore, R. Brightwell
{"title":"FA-MPI的实际弹性案例,一个事务性容错MPI","authors":"Amin Hassani, A. Skjellum, P. Bangalore, R. Brightwell","doi":"10.1145/2831129.2831130","DOIUrl":null,"url":null,"abstract":"MPI is insufficient when confronting failures. FA-MPI (Fault-Aware MPI) provides extensions to the MPI standard designed to enable data-parallel applications to achieve resilience without sacrificing scalability. FA-MPI introduces transactions as a novel extension to the MPI message-passing model. Transactions support failure detection, isolation, mitigation, and recovery via application-driven policies. To achieve maximum achievable performance of modern machines, overlapping communication and I/O with computation through non-blocking operations is of growing importance. Therefore, we emphasize fault-tolerant, non-blocking communication operations plus a set of nestable lightweight transactional TryBlock API extensions able to exploit system and application hierarchy. This strategy enables applications to run to completion with higher probability than nominally. We modified two proxy applications---MiniFE and LULESH---by adding FA-MPI semantics to them. Finally we present performance and overhead results for 1K MPI processes.","PeriodicalId":417011,"journal":{"name":"Workshop on Exascale MPI","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Practical resilient cases for FA-MPI, a transactional fault-tolerant MPI\",\"authors\":\"Amin Hassani, A. Skjellum, P. Bangalore, R. Brightwell\",\"doi\":\"10.1145/2831129.2831130\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"MPI is insufficient when confronting failures. FA-MPI (Fault-Aware MPI) provides extensions to the MPI standard designed to enable data-parallel applications to achieve resilience without sacrificing scalability. FA-MPI introduces transactions as a novel extension to the MPI message-passing model. Transactions support failure detection, isolation, mitigation, and recovery via application-driven policies. To achieve maximum achievable performance of modern machines, overlapping communication and I/O with computation through non-blocking operations is of growing importance. Therefore, we emphasize fault-tolerant, non-blocking communication operations plus a set of nestable lightweight transactional TryBlock API extensions able to exploit system and application hierarchy. This strategy enables applications to run to completion with higher probability than nominally. We modified two proxy applications---MiniFE and LULESH---by adding FA-MPI semantics to them. Finally we present performance and overhead results for 1K MPI processes.\",\"PeriodicalId\":417011,\"journal\":{\"name\":\"Workshop on Exascale MPI\",\"volume\":\"60 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on Exascale MPI\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2831129.2831130\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Exascale MPI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2831129.2831130","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Practical resilient cases for FA-MPI, a transactional fault-tolerant MPI
MPI is insufficient when confronting failures. FA-MPI (Fault-Aware MPI) provides extensions to the MPI standard designed to enable data-parallel applications to achieve resilience without sacrificing scalability. FA-MPI introduces transactions as a novel extension to the MPI message-passing model. Transactions support failure detection, isolation, mitigation, and recovery via application-driven policies. To achieve maximum achievable performance of modern machines, overlapping communication and I/O with computation through non-blocking operations is of growing importance. Therefore, we emphasize fault-tolerant, non-blocking communication operations plus a set of nestable lightweight transactional TryBlock API extensions able to exploit system and application hierarchy. This strategy enables applications to run to completion with higher probability than nominally. We modified two proxy applications---MiniFE and LULESH---by adding FA-MPI semantics to them. Finally we present performance and overhead results for 1K MPI processes.