{"title":"量化,权衡分析,以及可靠性和可用性的最佳检查点放置","authors":"Omer Subasi, R. Tipireddy, S. Krishnamoorthy","doi":"10.1109/HiPC.2018.00029","DOIUrl":null,"url":null,"abstract":"Checkpointing is the most widely used technique in high-performance computing (HPC) to ensure the application progress in the presence of failures. In this paper, we present mathematical models of checkpointing systems to quantify their reliability and availability. We perform trade-off analysis with respect to resource costs and reliability. Then, we explore the optimal checkpoint placement for checkpointing systems to maximize system availability. Finally, in a rigorous manner, we comparatively analyze the behavior of redundant systems where replication and repair mechanisms are employed. We postulate that the proposed models can aid system designers, who can instantiate our models to assess and quantify the availability and reliability of systems of interest.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"158 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Quantification, Trade-off Analysis, and Optimal Checkpoint Placement for Reliability and Availability\",\"authors\":\"Omer Subasi, R. Tipireddy, S. Krishnamoorthy\",\"doi\":\"10.1109/HiPC.2018.00029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Checkpointing is the most widely used technique in high-performance computing (HPC) to ensure the application progress in the presence of failures. In this paper, we present mathematical models of checkpointing systems to quantify their reliability and availability. We perform trade-off analysis with respect to resource costs and reliability. Then, we explore the optimal checkpoint placement for checkpointing systems to maximize system availability. Finally, in a rigorous manner, we comparatively analyze the behavior of redundant systems where replication and repair mechanisms are employed. We postulate that the proposed models can aid system designers, who can instantiate our models to assess and quantify the availability and reliability of systems of interest.\",\"PeriodicalId\":113335,\"journal\":{\"name\":\"2018 IEEE 25th International Conference on High Performance Computing (HiPC)\",\"volume\":\"158 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 25th International Conference on High Performance Computing (HiPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HiPC.2018.00029\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2018.00029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Quantification, Trade-off Analysis, and Optimal Checkpoint Placement for Reliability and Availability
Checkpointing is the most widely used technique in high-performance computing (HPC) to ensure the application progress in the presence of failures. In this paper, we present mathematical models of checkpointing systems to quantify their reliability and availability. We perform trade-off analysis with respect to resource costs and reliability. Then, we explore the optimal checkpoint placement for checkpointing systems to maximize system availability. Finally, in a rigorous manner, we comparatively analyze the behavior of redundant systems where replication and repair mechanisms are employed. We postulate that the proposed models can aid system designers, who can instantiate our models to assess and quantify the availability and reliability of systems of interest.